_ Publications of the hoa and Canadian Committees 
on | Modern Languages 


VOLUME ONE 


‘New York eer 


Bx. 


oe 30 3 Bork E 
THE MACMILLAN COMPANY a 
1927 . 


All rights teserved 


NEW YORK EXPERIMENTS 
WITH 
NEW-TYPE MODERN LANGUAGE TESTIS 


Publications of the American and Canadian 
Committees on Modern Languages 


NEW YORK EXPERIMENTS WITH NEW-TYPE 
MODERN LANGUAGE TESTS 


By Ben D. Woop 


A LASORATORY STUDY OF THE READING OF 
MODERN FOREIGN LANGUAGES 


By G. T. BuswE Lu 


ENROLMENT IN MODERN FOREIGN 
LANGUAGES IN SECONDARY SCHOOLS AND 
COLLEGES IN THE UNITED STATES 


By ANDREW T. WYLIE 


A GRADED SPANISH WORD BOOK* 


ComPiILeD By Mitton A. BucHANAN 


* Published by The University of Toronto Press, Toronto, Canada 


New York Experiments with 
New-l'ype Modern Language Tests 


INCLUDING 


A SURVEY OF MODERN LANGUAGE ACHIEVEMENT IN THE 
y JUNIOR HIGH SCHOOLS OF NEW YORK CITY, JUNE, 1925; 


THE REGENTS EXPERIMENT OF JUNE, 1925, WITH NEW- 
TYPE TESTS IN FRENCH, GERMAN, SPANISH, AND PHYSICS 


WITH A FOREWORD BY THE COMMISSIONER OF EDUCATION OF NEW YORK STATE 
AND 


A SECOND SURVEY OF MODERN LANGUAGE ACHIEVE- 
MENT IN THE JUNIOR HIGH SCHOOLS OF NEW YORK CITY 
/ JUNE, 1926 


BY 


BEN D. WOOD 


DIRECTOR OF THE BUREAU OF COLLEGIATE EDUCATIONAL RESEARCH 
COLUMBIA COLLEGE 


New Dork 
THE MACMILLAN COMPANY 
1927 


All rights reserved 


CopyricuT, 1927, 


By THE MACMILLAN COMPANY. 


Set up and electrotyped. Published August, 1927. 


FEV eS 1972 
on le Me ALM the 


ve eee 


f 


Pm ous 


PRINTED IN THE UNITED STATES OF AMERICA 
BY BERWICK & SMITH CO. 


FOREWORD 


This monograph is the first of a series to be issued by the Modern Foreign 
» Language Study with the Canadian Committee on Modern Languages, 
under the auspices of the American Council on Education. These organiza- 
tions, supported by a grant from the Carnegie Corporation of New York, 
have undertaken an investigation of the teaching and study of the modern 
foreign languages and the development of means for their improvement, 
particularly at the levels of instruction in secondary school and the earlier 
years in college. 

The present work consists of three studies, all based on the application of 
objective language tests. These tests have been constructed according to 
techniques tried out and validated during the last four years by the Col- 
umbia College Research Bureau for testing achievement in the modern 
languages. 

The first paper describes the results of applying these tests to all students 
of French and Spanish in the junior high schools of New York City at the 
final examinations in June, 1925. The author analyzes the tests themselves 
and presents the results indicated by their administration. 

The second study resulted from the administration of similar tests in 
French, Spanish, German, and Physics, in the June, 1925 examinations of 
the Regents of the State of New York, when the examination period was 
divided equally between new-type and old-type tests. Here, also, a careful 
analysis of the examinations accompanies a presentation of the results 
shown by correlations between the returns from the new-type and the 
old-type tests. 

The third study relates to a second survey of achievement in French 
and Spanish in the junior high schools of New York City. This survey was 
carried out in June, 1926, one year after the survey on which the first study 
was based, and the report of its results, published as Part Three in this 
volume, is a logical continuation of the first study. 

Professor Wood’s three studies open a field of inquiry of great importance 
and indicate results which are fundamental for modern language testing. 
The total number of returns involved — for the Regents experiment 31,025, 
and for the junior high school surveys approximately 50,000 — dwarfs by 
comparison every previous experiment with new-type tests in the modern 
languages. Furthermore, the field of administration offered unusually 
favorable features, presenting in each case groups of schools in a unified 
system under highly centralized modern language supervision, and, in the 
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case of the Regents experiment, employing an examination routine long in 
use. All this made for pupil material as homogencous as would be possible 
with so large a number of American school children, and for conditions of 
administration as nearly identical as our educational groups can supply. 

Under these circumstances the possibilities with new-type tests have 
been given a stern tryout. They have been used to sound the situation as 
regards achievement and placement in the junior high school and, together 
with the old-type tests, to measure the same features in all the college- 
preparing schools of the State. Of the greatest importance is the oppor- 
tunity which the author has had for a searching analysis of the old and 
the new types of tests and a comparative study of their efficiency as tools 
of measurement. 

It is hoped that modern language teachers and school administrators, 
as well as experts in the field of educational theory, will give the reports on 
these experiments careful examination. It has been necessary in certain 
parts of the reports to use a terminology which may seem to some modern 
language teachers unfamiliar and technical. However, teachers of the 
foreign languages are coming to recognize the importance of recent educa- 
tional research for their work and are familiarizing themselves in ever- 
increasing numbers with the terminology which the use of statistical 
methods involves. To those readers who do not readily comprehend the 
more intricate discussions of correlations, probable errors, deviations, and 
the like, the following reports offer nevertheless in non-technical language 
a wealth of significant detail on a subject of paramount importance for the 
organization of modern language teaching. 

The first two studies were carried through quite independently of each 
other and the reports were prepared without thought of joint issue. This 
accounts for numerous repetitions in the text, particularly in explanation of 
test techniques, forms of tables, graphs, etc. The reader will, however, 
become convinced that the studies belong under one cover, since the results 
which they offer are secured by the same type of test and indicate the same 
weaknesses in the present system of grading and classifying modern lan- 
guage pupils. To understand the possibilities of tests of this charac- 
ter it is very helpful to be able to traverse the results derived from 
two fields where the new-type tests have been applied under different 
conditions. The third study was originally prepared for separate publi- 
cation, but since it became available before this volume got through the 
press, it is included because of its close relations to the first study. 
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PART ONE - 
SURVEY OF MODERN LANGUAGE ACHIEVEMENT 


IN THE JUNIOR HIGH SCHOOLS OF 
NEW. YORKSGLITY 


I 


INTRODUCTION 


Ar the request of Mr. Jacob Greenberg, Director of Foreign Languages 
in the Junior High Schools of New York City, I undertook in the spring 
of 1925 to assist in the construction of tests which should afford objective, 
reliable, and comparable measures of achievement in French and in Spanish, 
and which should be adapted to the whole range of achievement in modern 
language classes in the junior high schools. It was agreed that two equiva- 
lent examinations (Forms A and B) should be constructed for each lan- 
guage; that, in accordance with the indications of recent experiments in 
modern language testing,! the examinations should attempt to measure 
vocabulary, reading-comprehension, and grammar, without attempting to 
measure the oral-aural abilities and other less measurable values of mod- 
ern language teaching; and that the examinations should be constructed 
not solely on the basis of the New York City junior high school syllabi 
in French and Spanish, but should take advantage of the word-counts ” 
which had been made since the production of the syllabi. Fortunately, 
there were no irreconcilable conflicts between the syllabi and the results 
of recent investigations on the common fundamentals of vocabulary, 
grammar, and idioms, so that it was possible to produce tests which were 
valid not only for the junior high schools of New York City but for ele- 
mentary modern language classes all over the country. 

The French examination (Forms A and B) was constructed in collabo- 
ration with Mr. Greenberg and Miss Blanche Allain; and the Spanish 
examination (Forms A and B) in collaboration with Professor Frank C. 
Calcott and Mr. R. H. Williams, of the Spanish Department of Columbia 
University. 

The tests. — The tests in French and Spanish are exactly parallel in 
form, each consisting of three Parts: 

Part I is a vocabulary test of 100 foreign language words, each of 
which is followed by five English words. The student shows his knowl- 
edge of the word by selecting the one English word in the five which 

1See Méras, Roth, and Wood: ‘“‘A Placement Test in French,’’ Contributions to Education, Vol. I, 
J. C. Bell, Editor, World Book Co., Yonkers, N. Y. 1925. 

2 See e.g. Henmon’s The French Word Book, University of Wisconsin Bulletin No. 3, September, 1924; 
Wood’s A Comparative Study of the Vocabularies of Sixteen French Textbooks, The Modern Language,Journal, 
February, 1927; and Jamieson A Standardized Vocabulary for Elementary Spanish, The Modern Language 
Journal, March, 1924; and the syllabi and vocabulary lists of the junior and senior high schools of New 


York City. 
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corresponds in meaning to the French or Spanish word. The vocabulary 
words were selected mainly from among the most frequently used 2000. 

Part II is a reading-comprehension test, consisting of 60 carefully 
graded statements in French or Spanish, respectively, each of which has 
five alternative endings. Only one of the endings is coherent or true, 
and the student shows his understanding of each statement by selecting 
this one ending. 

Part III is a grammar test consisting of 60 or 65 short English sentences 
or phrases, each of which is followed by a French or Spanish translation 
which is incomplete. The student shows his knowledge of the gram- 
matical and other points involved by completing the translations. 

The selection of these forms of questions is based upon four years of 
continuous experimenting with many kinds and forms of questions. 
Their traits will be discussed at length later in this report. 

Each test thus consists of 220 or 225 elements, which constitute a 
sampling of modern language materials much broader and more varied 
than is possible in the old-type subjective examinations. The students 
are allowed thirty minutes for each Part, or ninety minutes for the whole 
test. The broad sampling is made possible by the fact that little time 
is lost in irrelevant activities, such as writing out translations of whole 
sentences and paragraphs. Writing out the translation of a sentence in 
longhand is not the only way in which a student may unambiguously 
indicate an exact understanding of the sentence, or ability to translate 
it. When we consider the complete objectivity of the scoring of the 
responses to this large number of carefully graded questions, it is not 
surprising that we achieve with examinations of this type results which 
are two or three times as reliable as those obtained with old-type tests. 
It is the objectivity of these tests which makes possible a comparison of 
their measurements. It is hoped that the importance of comparability 
in educational measurements will be made clear in succeeding sections of 
this report. Form A of each test is reproduced below on pages 47-85. 
The method of selecting the items for these tests and of equating the 
two forms of each has been described in detail in the article already cited 
on page 3. The method is a standardized procedure which has been 
used in principle in the construction of nearly all standard and valid 
examinations in other subject matters. The principles underlying the 
method have been set forth in a large number of publications. The most 
convenient source for the modern language teacher is probably the Manu- 
als of Dvirections for the Columbia Research Bureau Tests in Modern 
Languages recently published by the World Book Company, Yonkers, 
INGE Ye 

Administering and scoring the tests. — The modern language work in 
the junior high schools of New York City normally runs through two 
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years of time, or four semesters, and is designed to cover the first year 
of modern language work in the senior high schools. The normal semester 
classes are designated in the usual way from lowest to highest as 8A, 
8B, 9A, and 9B. Students who successfully complete all four semesters 
are normally eligible to enter second-year classes in the senior high school. 
Parallel with the four-semester course we have the three-semester or “rapid 
advancement”’ course, which covers the same material as the four-semester 
or normal course, but does it in three semesters. Students are put in the 
rapid advancement group on the basis of general scholarship, and not 
on the basis of particular aptitude for modern language work. The three 
rapid advancement classes are designated from lowest to highest as RB, 
RC, and RD. 

Until the present survey was undertaken it had been customary to 
give a uniform examination each year to the 9B and RD classes, but 
not to any other modern language classes in the junior high schools. 
The main purpose of this uniform examination at the end of the junior 
high school modern language course was to eliminate the unfit, and allow 
only well-prepared students who gave promise of succeeding in senior 
high school modern language work to enter second-year classes in the 
senior high school. Thus the differentiating function of the junior high 
school was invoked on a uniform city-wide standard mainly at the end 
of the junior high school modern language course. It was in order to 
facilitate adjustments and constructive educational guidance during the 
junior high school course, as well as at its end, that these tests were de- 
signed to cover the whole range of achievement in all four semesters, 
thus affording accurate and comparable measures and valid city-wide 
standards for all four normal semester classes and all three rapid advance- 
ment classes. This use of a single scale of achievement for all classes 
represents a forward step in the administration of modern language 
instruction which is fundamental and of far-reaching importance. 

The old custom of giving separate subjective examinations to each 
semester- or year-class is slowly but surely disappearing, and deservedly 
so, because it is indefensible. It is part and parcel of the old time- 
serving conception of achievement, which was based upon ignorance 
or disregard of the fundamental fact of individual differences. The in- 
congruity of attempting to apologize for an examination system which 
often passes a student in third-year French and on the same day fails 
him in second-year French, while denying admission to third-year French 
examinations until the full sentence of three years has been served, is 
too manifest even to the most casual observer. 

In June, 1925, the new-type Comprehensive Examination in French for 
the Junior High Schools of New York City was administered to nearly 
19,000 students taking French: 8A, 8B, 9A, 9B, RB, RC, and RD; and 
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the corresponding Spanish examination to nearly 6,500 students taking 
corresponding semesters of Spanish in the junior high schools of New 
York City. The tests were administered and scored by the teachers. 
Some irregularities occurred, as is always the case when such large num- 
bers of teachers and pupils are concerned; but they were very few in 
number and of small consequence. In so far as they had any influence 
they tended to make the tests seem less reliable than they really are. 
Several conferences were held by Mr. Greenberg with the entire staff of 
teachers, at which the whole procedure for administering and scoring 
the tests was explained in detail. In addition, mimeographed sheets with 
complete directions for administering the tests, and detailed scoring pro- 
cedure, were distributed. 

In order to make the results of this first trial of an objective and compa- 
rable scale as reliable as possible, it was planned to check all the scoring 
in Mr. Greenberg’s office. By this procedure it was thought that other 
irregularities might be detected and corrected. The first paragraph on 
the mimeographed sheets distributed to the teachers at the conference 
read as follows: 

“In scoring tests of this type, the mechanical method used is of great 
importance. Since the scoring will be checked in the office of the Director 
of Foreign Languages, it is imperative that each teacher should use the 
same system of marking. A scoring key will be furnished to each depart- 
ment head, from which a sufficient number of copies will be made for 
each teacher in the department. No deviations from the key and no 
additions to it should be made by any teacher without consulting the 
department head. All changes in, or additions to, the key should be 
reported in detail to the office of the Director.” 

The task of checking the scoring proved to be too great, so that only 
a sampling of the papers from each school were checked. Nearly all of 
the mistakes found were mistakes of arithmetic. As was fully expected, 
several additions had to be made to the key for Part III. The directions 
to the teachers included the following: 

“The score on each Part is the number of correct answers. No answer 
in any Part is to receive any credit unless it is absolutely correct, com- 
plete, and unambiguously indicated or written; but students should not 
be penalized for poor penmanship, and in Part III legitimate equivalents 
of the key answers should receive full credit. In Part III absolute cor- 
rectness must be rigorously enforced: no answer is to receive any credit 
if it is in any respect — spelling, punctuation, capitalization, ete. — de- 
ficient or incomplete. Each correct answer in each Part should be marked 
immediately at its right with a short, straight, neat, horizontal line, in 
red or blue crayon: the score on each Part is the number of these red or 
blue lines (not checks!). The scores on the Parts should be entered, 
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after the counting has been checked, on the cover page in the spaces 
provided; the total score is the sum of the three Part-scores. It is im- 
portant that the red or blue marks be written neatly and clearly, in a 
straight vertical column, so that they may be easily and accurately 
counted. Check marks should be avoided; they are clumsy and confusing 
and render the checking of the scoring and of the counting very difficult. 
A little care and regularity and neatness in marking the correct answers 
will avoid many serious errors.” 

Application of the test results. — As soon as each teacher had completed 
the scoring of her papers, and had entered the scores in her class roll, 
the papers were sent to the Office of the Director. Distributions of the 
scores of each class on a city-wide basis were made as rapidly as the 
papers arrived. On the basis of these distributions the passing and fail- 
ing grades for each class were established, and tables for converting 
scores into the usual percentage grades were sent to each department 
head. If an 8A student made a score above the passing line for 9A or 
9B, he was given a passing grade in 9A or 9B; if a student in 9A or 9B 
secured a score below the passing line of 8B or 8A, he was failed not only 
in 9B and 9A, but also in 8B and 8A. The adoption by Mr. Greenberg 
of this method of assigning grades, which will be explained in detail in 
succeeding pages, is a notable event, because so far as the writer knows, 
it is the first instance in the history of modern language teaching in which 
the principle of classification on the basis of actual individual achievement 
was, at a single stroke, substituted for the “time-serving”’ basis. The 
reader should notice that however radical and drastic the scheme may 
appear, it is in reality nothing but a sane and logical extension of the 
sound principle already recognized in the organization of rapid advance- 
ment classes, and that it tends to a more effective and fuller realization 
of individual capacities of students than is otherwise possible. 

Administrative exigencies in the various schools, and the difficulties 
attending promotions and demotions by single subject-matters rather 
than by school-grade, made it impossible to take anything like full ad- 
vantage of the city-wide reclassification which these tests made possible. 
Whatever significance this survey may have, however, does not depend 
upon whether a reclassification was effected in 1925 or in any other one 
year: its significance lies rather in the fact that it has uncovered a crucial 
need for sound classification the extent of which was hardly suspected, 
and at the same time has demonstrated a sure and simple way for se- 
curing in exact and comparable units the measurements which are a 
sine qua non for the proper classification and educational guidance of 
students with respect to modern language work.! 


1 The significance of these disclosures is augmented by the confirmatory evidence uncovered ee the 
use of new-type tests in the Regents Examinations in French, Spanish, and German of June, 1925. See 
particularly pp. 146 to 198 of this volume. 
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In general terms, this survey shows how over 25,000 young hopefuls 
in the largest city in the country are starting their modern language 
careers; it emphasizes the tremendous importance of individual differ- 
ences, and the stark fallacy of taking calendar-time spent “studying” a 
modern language or rather attending modern language classes, as a 
measure of achievement in that language. 


Nore. — In reviewing the data of this survey we cannot escape the suspicion that 
only a fraction of these 25,000 youngsters really belong in foreign language classes. This 
suspicion becomes almost a conviction when we compare the foreign language situation 
in this country with that in European countries. : 

Foreign language instruction is one of the oldest and most important elements in the 
liberal curriculum in all the countries of western civilization. For many centuries it has 
been a successful and exceedingly useful part of the education of cultivated people. No 
part of the liberal curriculum in Europe enjoys (or deserves) a greater prestige; but it has 
no such place in the primary school curriculum. 

In this country we have tried to put foreign language study in the common curriculum 
and have unjustifiably assumed it would be as successful as it has been in what the Euro- 
peans call secondary education. We have inherited the prestige of foreign language study 
without inheriting all the conditions which created that prestige centuries ago, and which 
have maintained it intact to this day, in Europe. Three of the most important conditions 
which have made foreign language study successful in Europe deserve special mention 
because of their sharp contrast with conditions in this country. 

In Europe foreign language instruction (a) is restricted to highly selected students, 
who are really competent to learn a foreign language in addition to the native language, 
and who can reach a general culture level high enough to make mastery of a foreign 
language useful in adult life activities; (6) is given by teachers of a high general culture 
level, whose mastery of the language itself and whose knowledge of the culture and 
civilization of the people that use rt are above question, and who have had expert peda- 
gogical training; and (c) is continued, when once begun, long enough to make possible a 
genuine mastery of the language itself and a real acquaintance with the culture, civiliza- 
tion and history of the people that use it. 

The present report is mainly concerned with the problem of bringing about in America 
the first condition mentioned above, namely, that of restricting foreign language in- 
struction to students who are really competent to learn more than one language in a way 
that will be useful and satisfying to them. The ideal is to define foreign language ob- 
jectives in relation to the capacities, interests, and needs of individual students. Tenta- 
tive decisions with regard to individual’students can be made on the basis of intelligence 
test results, facility with the native language, and general school standing; but borderline 
cases can be safely disposed of only by actual trial, with close observation of progress by 
means of such objective and comparable measuring devices as the tests used in this 
survey. 

Not the least important result of better educational guidance and of earlier elimination 
from foreign language work of students not fitted for it by original capacities, interests, 
and needs would be an immediate possibility of raising the general and special culture 
levels of the foreign-language teaching groups. The indiscriminate foreign language 
requirement for all kinds of students has so inflated the class roils that many schools have 
had to employ substitute teachers, and the general shortage of teachers has prevented 
any notable raising of the licensing standards. Some one has remarked that if foreign 
language classes were restricted to the right kinds of students, there would be enough 
good teachers to go around. The really competent students, having the undivided 
attention of better teachers, would respond at once to the sense of real achievement, and 
larger proportions of them would willingly continue foreign language study long enough 
to reach a genuinely satisfying stage of mastery. 


et 
THE TEST RESULTS: CLASS NORMS 


Median and quartile scores by classes. — Table 1 shows the numbers 
of students in each class in French and in Spanish, and the median and 
lower- and upper-quartile scores of each such class. 


TABLE 1 


NUMBERS OF STUDENTS TESTED IN EACH CLASS IN FRENCH AND SPANISH IN THE JUNIOR 
HIGH SCHOOLS OF New York CITY, AND THEIR QUARTILE AND MEDIAN SCORES, JUNE, 
1925. 


French 
SA 8B 9A 9B RB RC RD 
Number. . . 3,736 3,041 | 2,604 2,417 2,766 2,415 1,898 
Lower Quartile . 29.2 47.0 70.5 110.2 40.0 75.5 116.7 
Median. . . 39.5 61.0 87.3 130.0 51.2 94.0 140.2 
Upper Quartile 50.0 ea 105.5 148.0 64.3 111.5 160.5 
Spanish 
8A 8B 9A 9B RB RC RD 
Number... J,486 1,332 658 688 850 683 721 
Lower Quartile . 32.8 44.7 60.5 86.4 35.0 68.6 93.8 
ein aie | ir ie | iced || 449 | see | rbe 
Upper Quartile 49.7 | 66.7 86.0 136.3 53.2 98.0 134.2 


In order to facilitate comparisons the medians and interquartile ranges 
of the classes are set forth graphically in Chart 1. 

Individual differences displayed. — This chart shows strikingly the 
inexorable force of individual differences and the subordinate importance 
of the time-factor with regard to the achievement of students, and their 
proper classification for instructional purposes, in modern language 


classes. 
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Cuart 1. — Medians and interquartile ranges of scores of each class in French and 
Spanish in the junior high schools of New York City, June, 1925. The numbers in 
the center represent raw scores on both tests. The projection lines above the numbers 
relate to French, and those below to Spanish classes. Solid lines represent normal, and 
dash-lines rapid advancement classes. 


1 The reader should be warned at once against making comparisons between the achievements of the 
French and the Spanish departments on the basis of this chart. The Spanish classes seem to be closer 
together and nearer to the lower end of the score-scale; since the two tests were constructed in the same 
manner and were both based mainly upon the most frequent 2000 words and upon the common funda- 
mentals of grammar in each language the reader may be inclined to accept this apparent superiority of 
achievement in French as real. Those who make this interpretation do so at their own risk, because 
we do not know the relation between the units of the French and of the Spanish tests. The units of the 
Spanish test may have the same relation to the French test units as inches have to centimeters. or the 
reverse. Various considerations, too involved for brief exposition, lead the writer to infer that the units 
on the two tests are very nearly equal, with the chances in favor of the theory that the Spanish units are 
slightly larger than the French units. But we cannot know, with the data at hand, the exact relations 
between these two tests, and therefore the reader is invited to make comparisons between projection- 
lines on one side of the line of raw scores, but not on opposite sides. 


JUNIOR HIGH SCHOOLS 11 


(1) The medians of the classes are separated by increasing intervals as 
we go up the scale, the difference (in terms of raw score-units) between 
the 9A and 9B medians being slightly more than twice as large in both 
languages as the difference between the 8A and 8B medians. This seems 
to be due almost entirely to the progressive elimination of students be- 
cause of low scholarship in general, and not because of particular de- 
ficiency in modern language work. 

(2) The rapid advancement classes consistently achieve more than the 
normal classes and in less time. When we consider that the rapid ad- 
vancement pupils were selected not on the basis of particular aptitude 
for or interest in modern language work, but on the basis of general scholar- 
ship, their consistent outdistancing of the normals (in only 75 per cent of 
the time taken by the normals) is a fact the significance of which for edu- 
cational philosophy and administration can hardly be overestimated. 

(3) The chart gives more than a complete justification of the provision 
for individual differences inherent in the rapid advancement classes, but 
it shows definitely the inadequacy of the method of selecting students 
for rapid classes, and the inadequacy of the ‘“‘two-track”’ or “two-speed” 
plan of the course. This inadequacy has been recognized by Director 
Greenberg and by the corps of modern language teachers in the junior 
high schools. Every one has known that some students are able to 
achieve more, and do achieve more, in one semester than many others 
can or do achieve in four or five semesters. The need for more flexible 
adjustments to individual capacities and achievements has been clearly 
recognized, but there has been no means for securing the exact and com- 
parable measurements of the achievements of all students in all classes 
at the end of each semester which are prerequisite to individual adjust- 
ments and readjustments. The teachers, in spite of the relief derived 
from the differentiation into normal and rapid classes, have repeatedly 
complained of the heterogeneity of their classes. Individual schools have 
done what they could to reorganize their classes so as to make possible 
at least a certain minimum efficiency of teaching; but a different examina- 
tion or standard was used for each class and no two schools used the same 
examinations, except those given at the end of the course, with the re- 
sult that standards in different schools and in different classes in the 
same school became a heterogeneous maze from which the only escape 
lay in a centrally administered system of standardized examinations, 
adapted to the whole range of modern language achievement in the junior 
high schools. Such a system of examinations will not solve the adjust- 
ment problem, but it will furnish that minimal information about indi- 
vidual achievement without which no administration, however wise, can 
build an educational ladder flexible enough to insure continuous adjust- 
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Student eliminations. —The number of French 9B and RD students 
who took the test in June, 1925, was 4315; and of 8A and RB students, 
6502; that is, the number of students in the last junior high school semester 
class is only about 66 per cent of the number in the first. The corre- 
sponding numbers for Spanish were 2336 and 1409, which give a “sur- 
vival” percentage of about 61 per cent. No analysis: of eliminations will 
be attempted; but it seems worth while to call attention to two facts: 
(a) the elimination rate appears to be somewhat greater in the normal 
than in the rapid classes; and (b) the students eliminated appear to be a 
very heterogeneous group with respect to modern language achievement; 
that is, they are eliminated on general grounds, perhaps partly by chance, 
rather than on the basis of modern language deficiency. The “chance’”’ 
nature of the eliminations is indicated by the increasing range of scores 
in semester classes from 8A and RB to 9B and RD classes in Chart 1; and 
the more extensive eliminations in normal classes are indicated in Table 2, 
below. The 9B French class is only 65 per cent as large as the 8A French 
class; while the RD French class is 70 per cent as large as the RB French 
class; and the corresponding figures for Spanish are 48 per cent and 85 
per cent. As already indicated above, these figures should not be taken 
without caution; they indicate only the gross apparent elimination rate. 
The most important fact in this connection is that some good modern 
language students are eliminated, and many hopeless failures are re- 
tained. 

TABLE 2 


GROSS APPARENT ELIMINATIONS FROM MODERN LANGUAGE CLASSES IN JUNIOR HIGH 
SCHOOL AS INDICATED BY EXPRESSING THE NUMBERS OF STUDENTS IN 2ND, 3RD, AND 
4TH SEMESTER CLASSES AS PER CENTS OF THE NUMBERS IN BEGINNING SEMESTER CLASSES. 


French 
8A 8B “9A 9B RB RC RD 
Number of stu- 
dents, June, 
1925 eis (SA 3736 3041 2604 2417 2766 2415 1898 
Approximate per 
Conve jee LOOSE 81% 70% 65% 100% 89% 70% 
Spanish 
| 8A 8B 9A 9B RB RC RD 
Number . 1486 | 1332 | 658 | 688 350 | 683 | 721 _ 
Per cent . 100% 90% 45% 48% 100% 80% 85% 


i a ee ee 
French 8A + RB = 6502; 9B + RD = 4815; total French apparent survival = 66% 
Spanish 8A + RB = 2336; 9B + RD = 1409; “ Spanish  “ “e = 61% 


————————————— eee 
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DISTRIBUTIONS OF SCORES OF NORMAL AND RAPID ADVANCEMENT CLASSES IN FRENCH 
IN THE JUNIOR HIGH SCHOOLS OF New York City. NOTE THE OVERLAPPING: SOME 
STUDENTS IN 9B ARE BELOW THE AVERAGE OF 8A, AND SOME 8A STUDENTS ARE ABOVE 
THE AVERAGE OF 9B. EACH CLASS REPRESENTS THE WHOLE GAMUT OF ACHIEVEMENT. 
Tue RD crass, IN SPITE OF ITS HIGH AVERAGE, INCLUDES SOME STUDENTS WHO ARE 
BELOW 8B AVERAGE, AND MANY WHO ARE BELOW 9A AVERAGE IN ACHIEVEMENT. 


Scores 


0-2 

3-7 

8-12 
US ELIZ’ 
18-22 
23-27 
28-32 
33-37 
3842 
43-47 
48-52 
53-57 
58-62 
63-67 
68-72 
73-77 
78-82 
83-87 
88-92 
93-97 
98-102 
103-107 
108-112 
113-117 
118-122 
123-127 
128-132 
133-137 
138- 142 
143-147 
148-152 
153-157 
158-162 
163-167 
168-172 
173-177 
178-182 
183-187 
188-192 
193-197 
198-202 
203-207 


208-212 


Mdn. 


8A 8B 

12 
88 3 
139 10 
267 28 
327 58 
395 93 
485 156 
525 188 
401 279 
353 | 276 
227 | 272 
Gy ane xal 
100 | 249 
68 | 235 
56 187 
38 | 143 
Sou ELLs 
18 | 100 
14 | 100 
7 69 
6 50 
5 49 
3 35 
2 23 
21 
1 9 
1 10 
4 
1 2 
1 2 
1 1 
2 

a! 
Mt 

i 

1 
2 
3736 | 3041 
29.28 | 47.02 
39.48 | 60.91 


50.28 | 77.35 | 108.00 


9A 9B RB RC RD 
4 
33 

1 47 1 

6 73 3 

4 2 122 7 

10 3 150 6 

27 2 194 22 
33 3 278 33 1 
64 Zi 276 36 1 
66 3 315 68 8 
84 12 274 a2 3 
112 10 258 73 8 
161 13 189 99 13 
167 22 LOZ ule 18 
209 25 128 | 125 12 
200 42 TiO 19 
185 63 66 | 183 35 
205 60 50 | 184 42 
181 86 Pai |) ANCES 48 
170 91 21 189 46 
134 114 12 | 167 71 
131 112 5 | 153 64 
104 | 145 4 | 145 82 
95 | 148 4 | 120 87 
75 | 174 ] 82 | 108 
a) |p ALAS 2 65 | 107 
47 172 44 | 120 
24 157 39 130 
23 161 25 | 123 
11 162 1 15 | 105 
6 | 124 5 | 1388 
oy) 90 5 | 119 
1 88 1 A le 
2 54 2 96 
J 45 2 66 
J 18 49 
17 2 35 
if 11 23 
2 1 6 

1 1 
2 
1 1 
2604 | 2417 || 2766 | 2415 | 1998 
72.49| 110.01]| 39.19 | 65.60} 118.2 

87.30] 129.99]| 51.25 | 94.04) 140.15 
148.25]| 64.32 | 111.56) 159.45 


ToTaL 


18,877 


RECLASSIFICATION 


Eliminated from 

modern language 

classes or repeat 
8A 


8A to 8B 


8B to 9A 


9A to 9B 


9B to 2nd-year 
senior high school 


—$— $< << 
=e 
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TABLE 4 
Spanish 
DISTRIBUTIONS OF SCORES OF NORMAL AND RAPID ADVANCEMENT CLASSES IN 
SPANISH IN THE JUNIOR HIGH SCHOOLS OF New York Ciry. NOTE THE OVERLAPPING; 
SOME STUDENTS IN 9B ARE BELOW THE AVERAGE OF 8A, AND SOME 8A STUDENTS ARE 
ABOVE THE AVERAGE OF 9B. EACH CLASS REPRESENTS THE WHOLE GAMUT OF ACHIEVE- 
MENT. THE RD cLASS, IN SPITE OF ITS HIGH AVERAGE, INCLUDES SOME STUDENTS WHO 
ARE BELOW 8B AVERAGE, AND MANY WHO ARE BELOW 9A AVERAGE IN ACHIEVEMENT. 


Scores 8A 8B 9A 9B RB RC RD TOTAL RECLASSIFICATION 
0-2 1 1 | Eliminated from 
3-7 8 8 | modern language 
8-12 uf 1 8 16 | classes or repeat 


39 73 8A 


94 Gai 382 8A to 8B 


33-37 | 202 73 
1 459 
43-47 | 228 135 145 16 2 549 
48-52 | 185 186 5 110 14 6 545 
538-57 87 174 4 92 30 5 432 8B to 9A 
58-62 43 130 73 4 55 34 16 355 


52 | 22 |} 203 9A to 9B 
ss-92 | 10 | 16| 45 | 48 


93-97 2 

98-102 6 13 15 51 51 37 175 
103-107 4 15 11 4] 40 48 160 
108-112 2 ital 15 42 25 42 137 
113-117 2 4 9 54 14 59 142 
118-122 1 a 8 32 10 59 117 | 9B to 2nd-year 
123-127 1 4 3 32 15 36 91 |in senior high 
128-132 1 6 33 6 43 89 | school 
133-137 1 3 3 24 4 46 81 
138-142 1 2 1 30 1 28 63 
143-147 il 30 4 30 65 
148-152 1 1 1 24 2 21 50 
153-157 1 17 23 41 
158-162 1 13 1 10 25 
163-167 2 1 11 14 
168-172 5 1 5 11 
173-177 1 2 3 6 
178-182 1 1 4 2 8 
183-187 1 1 3 1 2 8 
188-192 1 1 2 2 6 
193-197 1 1 il 2 5 
198-202 1 1 
203-207 1 2 3 
208-212 1 2 3 
213-217 1 1 
223-227 1 1 

N 1486 | 1332 | 658 | 688 850 | 683 | 721 || 6418 


Qi 32.29 | 44.74 | 60.30] 89.25]| 35.10 | 68.47 | 94.13 
_Mdn. | 41.31 | 54.69 | 71.80 | 109.07|| 44.86 | 82.63 | 115.33 
Q3 49.82 | 66.58 ] 85.09} 131.18]| 53.61 | 98.389 | 145.17 
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Distributions of scores by classes. — Tables 3 and 4 show in greater 
detail some of the facts brought out above, and expose the extreme 
heterogeneity of the classes so strikingly that comment is unnecessary. 
Chart 1 was presented above to justify the rapid advancement classes 
at the outset: Tables 3 and 4 show the need for extension of the rapid 
advancement idea on the basis of objective, reliable, and comparable 
measures of individual achievement. The inadequacy of the classifica- 
tion as evidenced in these tables is such that maximum efficiency of teach- 
ing 1s impossible. 

Students reclassified on uniform basis. — While we have Tables 3 and 4 
in mind we may consider in detail the method of transmuting scores on 
these tests into grades and the constructive principle underlying that 
method. ‘The last two columns in Tables 3 and 4 give an approximate 
picture of the reclassification scheme which was authorized by the Director 
of Foreign Languages and which was adopted by all the junior high schools 
in so far as administrative exigencies permitted. 

(1) Each student that received a score of less than 28 on the French 
test (Table 3), regardless of how many semesters he had “taken” French, 
was given a failing grade in French 8A: if the student receiving this fail- 
ing grade had been in French work more than one semester, it was recom- 
mended that he be eliminated from modern language work, unless there 
was evidence that he had not had a fair chance and that he desired to 
continue the study of French; if he had studied French only one semester, 
he might perhaps be eliminated, on collateral evidence of his capacities 
and interests, or be permitted to repeat French 8A. 

We use the phrase “fazling grade” only because it is customary, and 
not because we believe the phrase a legitimate one. The only legitimate 
purpose of examinations is to describe achievement as exactly and as 
meaningfully as possible, and the only worthy use that can be made of 
such descriptions is in constructively guiding students so that they may 
realize their highest potentialities in school and in life. The “failures”’ 
of most students may really be the “‘failures” of their teachers and educa- 
tional advisers. But the word is unnecessary and misleading, except 
where schools and examinations are still looked upon as punitive devices. 
We use the phrase only in reference to the students who after one or more 
semesters of work have secured scores below 28, and who therefore are 
the most likely candidates for transfer to other departments of school 
work. We use the word “eliminate” with the same reservations. 

(2) Students receiving scores between 28 and 53 were given “passing” 
grades in French 8A, regardless of how many semesters they had “taken” 
French; they were considered eligible to enter the 8B class, by promotion 
or demotion, as the case might be. All students demoted were to be 
interviewed by their educational advisers with a view to possible transfer 
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of efforts to fields other than the modern languages. All students pro- 
moted more than one semester were to be considered as candidates for 
rapid advancement classes. 

(3) Students receiving scores between 53 and 83 were given “passing”’ 
grades in 8B and were eligible to continue with 9A. Students who had 
reached this status in less than the normal time, i.e. in less than two 
semesters, were to be considered as candidates for rapid advancement 
classes; those who had reached this status only after three or more se- 
mesters were to be interviewed by their advisers for possible elimination 
from modern language work. 

(4) Students receiving scores between 83 and 118 were given “passing” 
grades in French 9A and were eligible to continue with French 9B. . 

(5) Students achieving scores above 118 were given “passing” grades 
in French 9Bzand were eligible to continue with work equal in grade to 
that of the second year in the senior high school. 

The reclassification scheme for Spanish students was identical with 
that for French students, and need not be specifically described. Only 
the score-limits in the preceding five paragraphs need be changed. This 
can be done-by the reader by reference to the right-hand column of Table 4. 
_ Fallacy of ‘“‘time-serving”’ basis for classification. — Thus some students 
who had taken French only one or two semesters could have been given 
credit for French 9B by their principals if they had acted fully under the 
authorization given by the Director of Foreign Languages in the Junior 
High Schools. Ta those who have not escaped from the old “time- 
serving” theory of achievement, and who still regard school grades as 
punitive devices and as means for conferring vague and mystical “credits” 
_and “honors” rather than as exact descriptions of achievement which are 
to be used for constructive educational guidance of students, this re- 
classification scheme may seem drastic beyond all reason. Some of the 
writer’s colleagues were frankly sceptical; not, however, as to the wisdom 
of demoting 9B students who were found below the average of the 8A 
or 8B or 9A classes; but “giving” 9B credit to students who had: “taken” 
only one semester of work and who had not “taken” 8B and 9A, seemed 
to them like giving much for little. But this scepticism is based on the 
mistaken notion that grades are “gifts” from the teacher to the student; 
whereas the teacher has no more right to “give” a grade in French than 
he has to give a grade in height or weight. To give a student of 9B 
achievement credit for 8A only, because he has studied French only one 
semester, is as justifiable as to force a precociously tall boy to sit at a 
school desk which is just high enough for the average student of his age. 
The students who achieve in one semester as much as average students 
do in four are precisely the students who are most likely to carry on with 
distinction the scholarly work for which the world is already indebted 
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to modern language scholars. Let us not punish these capable young 
students because they are not stupid enough to require four semesters 
for what they have achieved in one. 

It will be noticed that we have suggested no separate grading scheme 
for the rapid advancement classes. Our failure to do this is by no means 
an oversight, but in entire consonance with the principle that all students 
should be measured on one scale, and that all measurements or grades 
should, in and of themselves, do no more (and no less) than describe in 
understandable terms the achievement of each student at the time of 
taking the examination. How the student should be advised to continue 
modern language work depends, among other things, on his present 
achievement and on the time he has found necessary to reach his present 
degree of proficiency. A student who has progressed more than the aver- 
age student in equal or less time, should be put in an RC or RD class, 
according to his present achievement, or simply be promoted ahead to 
the normal class for which his present achievement makes him eligible. 
We need have no fear about “skipping” grades; these bright students 
have not skipped the work of the elided semesters; they have learned it 
because of superior effective ability, and probably in part to save them- 
selves from boredom after they had learned the preceding semester’s 
work but were still held in the same class to witness the slow progress 
of their ‘‘normal”’ classmates. The semester should not be regarded as 
the metronome for every performer in the school ensemble. 

It appears that teachers generally exaggerate the limiting force of 
syllabi, and overestimate the importance of the artificial segmentation of 
courses of study into semesters or years or classes. Fortunately, a modern 
language is not actually “chopped up” as the syllabi are, and students 
do not absorb a modern language by semester-“chops.” Students are 
always learning something that, according to the syllabus, they have no 
business to learn until a later semester, and are continually reinforcing 
something they have had in a preceding semester. Thus the good 8B 
students, while learning 8B material, learn more from a voluntary fore- 
taste of 9A material than their slow classmates will learn by devoting 
a full semester to 9A work. 

The important thing is that classes ahoudd be kept homogeneous, 
whether they are called “normal” or “rapid,” and that each student be 
kept busy, constantly stimulated and mentally challenged by his class 
work and progressing at his maximum rate. Indeed, so long as students 
are reclassified every semester on the basis of one uniform, objective, 
reliable examination which gives valid measures of achievement for all 
classes and which is so constructed as to insure comparability of measures 
over a series of years, it does not matter much whether they are put in 
normal or rapid classes. 


18 NEW-TYPE MODERN LANGUAGE TESTS 


Need for comparable measurements in junior and senior high schools. 
It cannot be too strongly urged that modern language measurements in 
the junior high should be comparable with those in the senior high 
school. Junior high school students theoretically cover only the first year 
of senior high school work in modern languages, and are therefore eligible 
only for second year work in senior high school when they leave the junior 
high school. This may or may not be a correct guess. It seems quite 
likely that second year work in the senior high school is adequate for the 
average student who has “had” four semesters of junior high school work; 
but it seems certain that many students above the 9B average are really 
capable of taking third-year senior high school work, and that it is a 
dangerous waste to keep such students back. The only way in which 
the articulation between junior and senior high school can be improved 
is to extend the use of comparable measuring devices to the fourth year 
in senior high school, or at least to the third year, and to make the same 
provisions (or better) for individual differences in the senior as in the junior 
high school. 

There seems to be no good reason why the same standard examinations, 
insuring comparability from year to year, and covering the whole range 
of modern language achievement in the first four senior high school years, 
should not be used in both junior and senior high schools. These tests 
have not been given to many senior high school students, but the indica- 
tions are that students at the end of the second year in senior high school 
modern language work are far from being distinctly above the 9B and 
RD junior high school students... The argument may be advanced that 
the students have forgotten their first-year work by the end of their second 
year in senior high school; but this argument is an unwarranted reflection 
on the senior high schools, and merits no consideration. The fact is that 
the students never learned their first-year materials, but got into second- 
year classes because of the use of unstandardized, subjective, and unre- 
liable examinations which did not afford comparable measures. Too much 
emphasis cannot be put upon the necessity for comparable measures 
throughout the first three or four high school years of modern language 
work; it seems clear that most of the maladjustments in elementary 
language classes are due to the unfortunate and indefensible custom of 
giving separate and distinct examinations to the four elementary high 
school classes. 

Equivalence of tests determined experimentally. — In further reference to 
the reclassification scheme proposed above (Tables 3 and 4), it should be 
noted that the same uniform standards for each semester class can be 
used as long as may be desired by the junior high schools. Form B of 
the junior high school examinations in French and Spanish are equiva- 


1 Cf. evidence given in the report of the Regents experiment, p. 208 this volume. 
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lent to Form A, and additional equivalent Forms can be constructed as 
often as desired. The equivalence of Forms A and B as to difficulty, 
variability, and representativeness of the various aspects of the languages 
was based upon careful word-counts and careful syntheses of grammatical 
elements and idioms, and upon experimental evidence secured by ad- 
ministering the materials in both Forms to large numbers of modern 
language students... The equivalence of the usual old-type modern 
language examinations which are constructed from year to year and used 
‘in high schools and for college entrance purposes, depends upon the 
judgments of modern language scholars. That there are large variations 
from year to year and place to place in the old-type subjective examina- 
tions every one knows; the wonder is that the variations are not larger. 
It is certainly not surprising to find that sometimes third-year examina- 
tions derived in the old way turn out to be easier than second-year examina- 
tions similarly derived. Such errors of judgment are not derogatory to 
the makers of such tests; but denial of such errors, and an insistence on 
the use of types of examinations which necessarily involve such errors, 
when the feasibility and convenience and accuracy of other types have 
been demonstrated, are at least signs of extraordinary conservatism and 
intellectual cautiousness. 

Distributions of scores by parts. — For the convenience of those who. 
are now using, or may in the future use, these tests, and who may desire 
to make comparisons between groups of students on the basis of vocabu- 
lary scores alone, or reading or grammar scores alone, we present here 
without comment distributions, with medians and quartiles, of scores of 
about 2000 students, selected at random, on each Part of each test. 

Overlapping of classes displayed graphically.— The four percentile 
curves below, Charts 2-5, are presented to show graphically the extent 
of the overlapping of classes, and for the convenience of other investiga- 
tors who are using these tests and who may wish to compare the scores 
of students in other schools with the scores of New York City junior high 
school students. These graphs seem very difficult to read at first sight, 
but they are really as simple as the ogive curves are graceful. 

Definition of a percentile curve. — The simplest and most effective de- 
scription of the percentile curve and its uses has been prepared by Otis.” 
His book cannot be too highly recommended to those who wish to be- 
come acquainted with the fundamentals of statistical methods in educa- 
tion. Otis elsewhere defines a percentile curve as follows: 

‘““A percentile curve is a smooth line having a horizontal length repre- 
senting 100 per cent of the scores of any group of individuals and so drawn 


1 Cf. below, pp. 207-208. 

2 Otis, A. S.: Statistical Method in Educational Measurements, World Book Co., Yonkers, N. Y., 1925, 
pp. 53-67. Cf. pp. 154 ff. this volume, for confirmatory evidence on overlapping of classes in senior high 
schools, 
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TABLE 5 
DISTRIBUTIONS OF SCORES ON BACH PART OF THE FRENCH AND SPANISH TESTS 
Part I FREQUENCIES Parr II Frequencies || Parr III FREQUENCIES 
Scorers Scones 
FRENCH SPANISH FRENCH SPANISH FRENCH SPANISH 
0 1 0 9 Biff 120 
2-4 1-2 6 20 134 132 
5-7 9 1 3-4 30 54 166 286 
8-10 35 10 5-6 59 53 158 213 
11-13 36 28 7-8 61 86 206 172 
14-16 46 37 9-10 87 133 149 156 
17-19 107 69 11-12 119 145 149 119 
20-22 116 93 13-14 122 (1631! 163 98 
23-25 140 106 15-16 137 160 98 84 
26-28 ifs 142 17-18 162 155 68 60 
29-31 il 142 19-20 112 146 90 51 
32-34 152 158 21—22 95 103 93 42 
35-37 149 151 23-24 99 115 64 20 
38-40 107 159 25-26 76 102 66 26 
41-43 99 149 27-28 82 87 56 2a 
44-46 91 138 29-30 66 67 49 Zt 
47-49 79 134 31-32 60 WZ 58 32 
50-52 87 103 33-384 81 68 46 33 
538-55 68 82 35-36 %3 62 45 22 
56-58 64 68 37-38 57 53 34 14 
59-61 62 63 39-40 68 63 22 22 
62-64 62 30) 41-42 46 38 29 12 
65-67 on, 36 43-44 45 31 18 10 
68-70 Ou 37 45-46 43 16 13 5 
71-73 49 29 47-48 56 11 18 2 
74-76 39 14 49-50 54 iM! 7 2 
77-79 Dit 4 51-52 61 5 6 3 
80-82 9 4 538-54 38 3 4 1 
83-85 1 z 50-06 35 3 3 
86-88 5 57-58 13 2 
89-91 2 59-60 10 3 
92-94 1 4 61-62 
95-97 1 63-64 1 
98-100 1 
N 2053 2007 2053 2007 2053 2007 
Qi 26.46 29.32 15.44 13.03 Tews 3.35 
Mdn. 35.96 39.24 23.64 19.41 13.34 7.61 
Q3 02.27 49,71 37.41 29,24 23.90 15.21 


that any point on the curve has a height representing the amount of a 
given score and a horizontal position on the graph representing the per 
cent of the scores of the group that is exceeded by the given score. A 
percentile curve shows at a glance not only the median score of a class 
but also the range and variability of the scores. It shows at a glance 
just what per cent of the scores of the class is exceeded by the score of 
any given individual and just what per cent of the class attains or ex- 
ceeds any given score. Two or more curves on the same graph show 
very vividly the amount of overlapping of the scores of different classes.” 
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From these charts the reader may learn, for example, what proportion 
of a given class, e.g., French 8B, is (1) below the average of the next 
lower, or (2) above the average of the next higher class in achievement; 
or what proportion of the 8B class is (3) nearer to the 8A average than to 
its Own average, and what proportion is (4) nearer to a higher class aver- 
age than to its own. To answer the first question, apply any straight- 
edge, which is kept constantly parallel to the horizontal lines in the 
chart, to the point in Chart 2 at which the 8A curve crosses the vertical 
line headed 50, that is, the 50th-percentile line; note where this straight- 
edge cuts the 8B curve, and read vertically above this point the per cent 
of 8B pupils who are below the average of 8A students — in this case 
about 13 per cent. 

To answer the second question, slide the straight-edge up to the point 
where the 9A curve crosses the 50th-percentile line, being careful to keep 
the edge parallel to the horizontal lines in the chart; note where the 
straight-edge cuts the 8B curve, and read vertically above this point the 
proportion of 8B students who are below the 9A average, in this case about 
83 per cent; subtract 83 from 100 to find the proportion of 8B students 
who are above the 9A average, i.e., 17 per cent. 

From these two simple operations we have learned the highly important 
fact that about 30 per cent of the 8B students in the junior high schools 
of New York City at the end of the 1924-1925 session were either below 
the average of the 8A class or above the average of the 9A class in achieve- 
ment. This fact means that the 8B teachers did not have classes to teach, 
but heterogeneous aggregations of unhappy students, some of whom 
were below the lower quartile of the 8A class, and some of whom were 
above the average of the 9B class! To expect teachers or students to be 
at maximum efficiency in such “classes,” especially when the enrolment 
averages about 50 students, and often reaches 75 per recitation group, is 
over-optimistic, to say the least. 

To answer the third question suggested above, apply the straight-edge 
at the point midway between the points at which the 8A and 8B curves 
cut the 50-percentile line; keeping the edge parallel to the horizontal 
lines on the chart, note the point at which the edge cuts the 8B curve, 
and read vertically above that point the per cent of 8B students who are 
nearer to the 8A average in achievement than to their own class average, 
that is, about 30 per cent. (Holding the edge in this position and reading 
above the point at which it cuts the 8A curve to the right, the reader 
learns that about 24 per cent of the 8A students are nearer to the 8B, 
or some higher average, than to their own class average.) 

To answer the fourth question, slide the edge up to the point midway 
between the points at which the 8B and 9A curves cross the 50-percentile 
line, taking care to keep it parallel to the horizontal lines in the graph; 
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Cuart 2. — Percentile curves of scores of the 8A, 8B, 9A, and 9B 
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Cuart 3. — Percentile curves of scores of the RB, RC, and RD French 
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note where the edge cuts the 8B curve, and read vertically above that 
point the per cent of 8B students who are not nearer to a higher average 
than to their own, that is, about 71%; subtract 71 from 100 and you have 
the per cent of 8B students who are nearer in achievement to some higher 
class-average than they are to their own class-average: about 297%. 
(Holding the edge in this position, and reading vertically above the point 
where it cuts the 9A curve to the left, we find that about 30% of the 
9A students are nearer to the 8B average than to their own class-average. ) 

From the first two questions we learned that about 30% of all 8B 
students are either below the average of the next lower class or above 
the average of the next higher class; from the answers to the last two 
questions, we learn that about 60% of the 8B students are nearer to some 
other class-average than they are to their own class-average in achieve- 
ment. Thus it appears that only about 40% of the 8B students are closer 
to their own class-average than to some other class-average. 

By reading similarly from the percentile graph for the rapid advance- 
ment classes in French, we learn that about 20% of the RC students are 
nearer to the RB class-average than to the RC class-average, and that 
about 23% are nearer to the RD class-average than to the RC class- 
average. At the same time we learn that 17% of the RB and about 
22% of the RD students are nearer to the RC class-average than to their 
own respective class-averages. About 4% of the lowest rapid class is 
above and about 7% of the highest rapid class is below the average of 
the middle rapid class; about 5% of the middle rapid class is below the 
average of the lowest and about 6% is above the average of the highest 
rapid class. 

From the graphs for all French classes, therefore, we learn that over 
2000 of the nearly 19,000 junior high school French students are mis- 
placed by a whole semester or more, and that approximately 10,000 of 
them are nearer to a higher or lower class-average than to the average 
of the class in which they were reciting during the spring of 1925. Teach- 
ers who have not struggled with such heterogeneity in large classes, or 
who have never experienced the delight of working with reasonably 
homogeneous classes, can scarcely apprehend the full extent of the evils, 
—in terms of the sacrifice of bright students on the altar of mediocrity, 
in terms of the cruel and more than useless browbeating of dull students, 
and in terms of wasted energies and frayed nerves of teachers, — which 
are inevitable concomitants of such misplacement and misguiding of stu- 
dents as this survey has uncovered in the junior high schools of New 
York City. Mr. Greenberg’s courage in uncovering this situation, and 
his. determination to give both teachers and students a fair chance 
are to be commended. The situation is only aggravated by the prevalence 
of different standards for each class among the various schools; but be- 
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fore taking up the variable standards of individual schools, we may sum- 
marize the figures for overlapping of Spanish classes. 

Spanish classes less homogeneous than French classes. — Referring to the 
percentile graph for the four normal Spanish classes, we find markedly 
greater heterogeneity than in the French classes. Of the Spanish 8A 
students 30% are nearer to some higher class median than to their own 
class median; 15% are above the 8B average, and at least 5% are above 
the 9A average! Of the 8B students about 18% are below the 8A aver- 
age, and about an equal proportion are above the 9A average! In other 
words, about 36% of the 8B Spanish students are misplaced by a whole 
semester or more, the upward misplacements being almost equal in extent 
and number to the downward misplacements. About 30% of the 8B 
students are nearer to the 8A and more than 30% are nearer to the 9A 
or some higher median than to their own class-average; fewer than 40% 
of the Spanish 8B students are nearer to their own median than to a 
higher or lower median! ; 

Of the 9A students about 16% are below the 8B, and about 5% below 
the 8A average; about 6% are above the 9B median; and about 50% are 
nearer to a higher or lower median than to their own median. 

Among the rapid Spanish classes, the overlapping is not so great nu- 
merically, but is probably more serious in its consequences and implica- 
tions; for these classes were formed specifically to allow for individual 
differences, to eliminate overlapping and provide conditions favorable to 
the optimum development of each student. We find that about 11% 
of the RB students are nearer to the RC than to their own median, while 
about 16% of the RC students are nearer to the RB than to their own 
class median. About 8% of the RC students surpass the RD median; 
and about 25% are nearer to it than to their own class median, while 
about 15% of the RD students are below the RC median, and about 
30% are nearer to it than to their own class median. 

In general terms, then, more than one in ten of the Spanish students 
in New York City junior high schools are misplaced by one semester or 
more, and considerably more than half of the students are nearer in 
achievement to the average of a class which is one semester higher or 
lower than to the average of the class in which they recited during the 
spring of 1925. 

The heterogeneity of the classes is strikingly pictured in a slightly 
different manner in Charts 6-9. These graphs will also be more easily 
understood by readers who are not familiar with percentile graphs. Charts 
6-9 are simple graphs of the distributions of scores in Tables 3 and 4. 

Standards in individual schools. — In Charts 10-13 the reader will find 
a partial explanation of the incredible overlapping of classes displayed 
above and will apprehend at first hand the inexorable need for comparable 
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Cuart 4, — Percentile curves of scores of the 8A, 8B, 9A, and 9B 
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Cuart 5. — Percentile curves of scores of the RB, RC, and RD 
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Cuart 6. — Graphs of the distributions of scores of 8A, 8B, 9A, and 9B students of 
French in the junior high schools of New York City, June, 1925. 


measures, such as are afforded by an objective test including a valid 
sampling of modern language elements carefully graded in difficulty to 
cover the whole range of achievement in the first two or three years, and 
which can be indefinitely duplicated by the construction of additional 
equivalent Forms. 

The charts are very easy to read. The four horizontal lines in Chart 10 
represent the averages of the classes on a city-wide basis, the score scale 
being at the left. Thus the reader will notice that the lowest line, marked 
at its left “8A,” is drawn just a little below the line corresponding to a 
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Cuart 7.— Graphs of the distributions of scores of RB, RC, and RD students of 
French in the junior high schools of New York City, June, 1925. 
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Cuart 8. — Graphs of the distributions of scores of 8A, 8B, 9A, and 9B students of 
Spanish in the junior high schools of New York City, June, 1925. 


score of 40; by referring to Table 1 the reader will find that the average 
of all 8A students was 39.5. The average score of all 8B students is given 
in Table 1 as 61, and the 8B line in Chart 10 is drawn slightly above the 
line corresponding to a score of 60. And so with the 9A and 9B lines. 

The vertical line near the left end of each of the class average lines 
represents the interquartile range of the class, that is, the range of the 
middle fifty per cent of the students in the class. Thus the vertical line 
near the left end of the 9B average line extends from the line corre- 
sponding to a score of 110 to a point above the 9B average line corre- 
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Cuart 9. — Graphs of the distributions of scores of RB, RC, and RD students of 
Spanish in the junior high schools of New York City, June, 1925, 
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Cuart 10. — Graphic comparison of median scores 


sponding to a score of 148. These points are also based upon Table 1; 
if the reader will look at Table 1 he will find that the lower quartile of 
the 9B class is 110, and the upper quartile is 148. In other words, 25% 
of the 9B students secured scores below 110, 50% secured scores between 
110 and 148, and 25% secured scores above 148. The interquartile 
ranges of the classes as indicated on these charts will be a convenience 
to certain types of readers, but they are not at all essential to an under- 
standing of the main features of Charts 10 to 13, inclusive. Teachers 
who have no predilection for statistical refinements may forget that 
these lines are in the charts. The important thing is to keep in mind 
that the horizontal lines, in order, represent the class averages in terms 
of the original units of the tests, which may be read on the score-scale 
at the extreme left of each chart. 

Without further explanation, let us compare the two schools whose 
identities in Chart 10 we have hidden under the numbers 5 and 6. Number 
5 had no 8A, 8B or 9B students; it had only students which it classified 
as 9A. But the heavy line rising from the 9A average-line shows that the 
average score of this 9A class is actually 127 — almost equal to the 9B 
city-wide average! In school Number 6 the light line dropping down 
from the 9B average-line shows that the students classified as 9B secured 
an average score of only 91 — almost average for 9A classes! In other 
words, No. 5 actually had a 9B class, but called it a 9A class; whereas 
No. 6 had no 9B class, but called what was actually a 9A class a 9B class. 
Nor did school No. 6 have any 9A class other than the group of students 
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it called 9B; for the students which this school classified as 9A are actually 
only slightly above the 8B city-wide average. Here we have two classes 
called 9A, one of which (School No. 5) is actually an average 9B class, 
and the other (No. 6) is only slightly above average for 8B classes, — 
a whole year apart in actual achievement. 

As a further example of how the charts should be read, let us compare 
schools 8, 9, and 10. Notice that in these charts we have arranged the 
schools in an ascending order of 9B class averages, from left to right. 
Thus schools 8, 9, and 10 have 9B classes with approximately equal aver- 
ages, all of which are nearer to the 9A city-wide average than to the 9B 
city-wide average. But the three schools are alike in no other respect, 
in so far as class standards are concerned. School 9 classes are con- 
sistently about one semester below city-wide standards; indeed, the 8B 
class in this school is somewhat below the city-wide 8A average. In 
schools 8 and 10 the 8A and 9A classes are almost exactly at par; but the 
so-called 8B classes in these schools vary from the city-wide average 
about equally in opposite directions. 

In schools 18 and 20 the 8A and 9A classes vary in opposite directions. 
In 18 the 9A class average is below the 8B city-wide standard, while in 
school 20 the 9A class average is about halfway up to the 9B standard; 
and in 18 the 8A together with the 8B class is below par, while in 20 the 
8A is almost average for 8B classes. 

In Chart 11 the variation of class standards within a given school is 
not so great nor so frequent as in Chart 10; that is, if a given school is 
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below city-wide standards for the RD class it is more likely to be below 
standard for RC and RB classes than above in one or both these classes; 
but we find the same tremendous variations between different schools. 
School No. 2 in Chart 11 had an aggregation of students which it classi- 
fied as RC, when actually they are considerably below the RB average, 
— indeed, they are slightly below the 8A city-wide average, although 
they have had two full semesters of work and were classified as rapid 
,advancement pupils! School 7 had an RD class which at the end of 


School 
TeLoees Aa oeG 7 CeO OnL Il 2tonl4As oelonl 7 1Seror20n2lee2 
150 a 


9B 


30 
20 


Cuart 12. — Graphic comparison of median scores of each ‘normal’ Spanish class 
in each school. 


three semesters of “rapid” advancement is just at par for RC pupils. 
School 9 has an RB class which is practically entirely below the lower 
quartile of 8A students! Probably 90% of the students in this RB class 
and in the RC class in school 2 should never have been allowed to study 
a modern language; yet they were not only allowed to try it, but were 
classified as rapid advancement pupils. 

Charts 12 and 13 for the normal and rapid Spanish classes tell a simi- 
lar story. Some readers may be inclined to discount the variations dis- 
played, and to attribute the appearance of extreme variations to imperfec- 
tions in the measuring device used, and to other like causes; they will 
be impelled to seek these explanations by the feeling that such gross varia- 
tions are really impossible. We shall now present such evidence as we 
have concerning the reliability and validity of the tests used. In judging 
the force of our statistical evidence the reader must remember that the 
tests were constructed with the intimate codperation of the Director of 
Foreign Languages in the Junior High Schools and of some of his teachers, 
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and with the counsel and advice of a large number of his teachers. The 
tests are the product of four years of continuous experimenting with 
many kinds of tests. Both the content and form of each question in 
each test were approved by teachers who were actually teaching pupils 
who were to be tested. It is also to be noted that we have had no ad- 
verse criticism on the fairness of the tests from any teachers whose pupils 
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Cuart 13.— Graphic comparison of median scores of each “rapid’’ Spanish class 
in each school. 


were subjected to and rated by them. On the contrary, we have had 
many expressions of satisfaction. 

As to the possibility of such extreme variations, let us remember that 
no uniform examinations of any sort have ever been given below the 9B 
and RD classes. In the absence of some common point of reference and 
of some common unit of reckoning it would be nothing short of a miracle 
if such variations did not exist. No one would be surprised if we found 
such variations in the subjective judgments of teachers, or even of trained 
physicists, as to the temperatures of a series of days; but the estimation 
of temperature is a simple task as compared with the matter of judging 
a very complex phenomenon such as modern language achievement. 


III 
RELIABILITY AND VALIDITY OF TESTS 


RELIABILITY oF TESTS 


Definition of reliability. — By reliability of a test is meant the degree 
to which it is consistent with itself in measuring the same thing two or 
more times. The most direct way to find the reliability or consistency 
of a test is to administer two equivalent Forms of the test to the same 
students, and then calculate the correlation between the two sets of scores. 
But this procedure is not always feasible. When two equivalent Forms 
cannot be administered to the same students, the reliability of the test 
may be estimated from the reliability of one-half the test, that is, from 
the correlation between the two sets of scores obtained by “splitting”’ 
the one Form of the test that has been given to students. “Splitting” 
a test is accomplished by treating the odd-numbered questions as though 
they constituted a test separate and independent of the test constituted 
by the even-numbered items. Thus each student receives two scores on 
the Form that has been administered: one score is based entirely upon 
the odd-numbered items, and the other is based entirely upon the even- 
numbered questions in the examination. The correlation between these 
two sets of scores shows the consistency or reliability of one-half of the 
test; by using the Spearman-Brown Formula the reliability of the whole 
test may be very closely estimated from the reliability of one-half of it. 
This is what we have done with the junior high school French and Spanish 
tests. 

It is easy to see that if we had an ideal test a student who received a 
total score of 60 would receive a score of 30 on each random half of the 
test; a student who received a total score of 90 would receive a score of 
45 on each random half; and so with other total scores. But ideal tests 
do not exist, and our best course is to find out what type of test approaches 
most nearly to the ideal and to use it in preference to all others, remember- 
ing, of course, that reliability alone does not make a test ideal. A test 
must not only measure consistently and accurately, but it must measure 
the right thing. 

Reliability of old-type examinations. — In considering modern language 
tests it is not our privilege to choose between perfect and imperfect instru- 
ments: we can only try to avoid the less reliable and seek the more valid 


and more reliable forms of examinations. The old-type of modern lan- 
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guage examination is of such a nature that it is very difficult to ascertain 
its reliability, a fact which in itself argues against its having high re- 
liability. What evidence there is tends to show that the ordinary old- 
type examination of two or three hours length has an average reliability 
of about 0.70; it is almost certain that it is not above 0.80, and reaches 
0.80 only in rare cases, if at all. These figures may be compared with the 
reliability coefficients found for the new-type junior high school tests. 

Reliability of new-type tests. — The correlations between random halves 
of the new-type tests for four 500-pupil samplings range for the French 
from 0.905 to 0.961, and for the Spanish test from 0.882 to 0.953, with 
averages at about 0.95 and 0.92, respectively. The four samplings were 
chosen so as to represent all types of schools: the first five hundred papers 
were all taken from the schools with. the lowest average score; the second 
sample was made up of examination returns from low-middle schools, 
the third from high-middle schools, and the fourth from the schools which 
averaged distinctly above the city-wide average. 

Treating the two thousand cases in each language as a single sampling 
we find that the correlation between random halves is nearly 0.94 for the 
French test, and a little over 0.93 for the Spanish test. From these figures 
we estimate by means of the Spearman-Brown Formula that the reliability 
of each total test is almost exactly 0.97. For a 90-minute examination 
this coefficient is very gratifying. It is only 0.17 above the estimated 
maximum reliability for old-type modern language examinations, but the 
error of estimate, expressed in terms of standard deviation, with a relia- 
bility of 0.97 is only about one-third the error of estimate with a reliability 
of 0.80. 

The correlations between random halves of each of the Parts of each 
test, and the Spearman-Brown estimates of total reliabilities of each of 
the Parts, and of the whole, of each of the two tests, are shown in Table 6. 
These coefficients may be accepted with confidence in view of the fact 
that each is based upon a random sampling of 2000 test returns. 


TABLE 6 


RELIABILITIES OF FRENCH AND SPANISH JUNIOR HIGH SCHOOL TESTS. THE NUMBER 
OF CASES FOR EACH CORRELATION Is 2,000. 


FRENCH SPANISH 
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JUNIOR HIGH SCHOOLS 41 


Desirability of 500-question tests. — The close correspondence between 
the reliability findings for the two tests is an interesting fact in itself, 
and carries certain implications which seem worthy of note. In the first 
place, it argues that the total reliability coefficients of both tests have 
been determined with high accuracy. In the second place, in view of 
the absolute parallelism in the derivations of the two tests, it indicates 
both that these forms of questions can be depended upon when similarly 
used to give very good results in other languages, and that with these 
forms of questions, used as here used, we have reached the upper limit 
of reliability when the total number of separate items is only 220 or 225. 
The importance of this last indication is that it compels us to make and 
use longer tests than we have been accustomed to thinking of as generous 
in length and in sampling of materials and of performances. These indi- 
cations have been more than confirmed in similar experiments in other 
subject matters. In modern language measurements we must become 
accustomed to tests of 350 to 500 separate items of the sorts used in these 
tests, and in about equal proportions. Tests of 350 to 500 items can 
easily be administered in 120 to 150 minutes, and the gain in reliability 
and significance will more than repay the few extra minutes used. 

Reliability of each Part of the new-type tests. — The correlations in 
Table 7 were calculated so as to,secure as many bits of independent evi- 


TABLE 7 
RELIABILITY COEFFICIENTS OF THE PARTS AND OF THE WHOLES OF THE JUNIOR 
HIGH SCHOOL TESTS, BASED ON FOUR 500-PUPIL SAMPLINGS. EACH OF THESE FOUR 
SAMPLINGS CONTAINS EQUAL NUMBERS OF 8A, 8B, 9A, AND 9B stupENTS. THE FOUR 
SAMPLINGS REPRESENT, RESPECTIVELY, SCHOOLS WHICH TEST FAR BELOW, A LITTLE 
BELOW, A LITTLE ABOVE, AND FAR ABOVE THE CITY-WIDE AVERAGE FOR ALL CLASSES. 


Types cf Low Middle Low | Middle High High Averages 
Schools 
Cases 500 500 500 500 4 x 500 


& mY 
ry |S-Bryl om |S-Brnj me |S-Brul mg [S-Bruj} ry |S-B ry 


French Test 


Parties. |) 852) | .919) | 841) | :913 | .895 || .945 || 847 | 917 | .859) |x.924 
Part II . | .908 | .952 | .864 | .927 | .944 | .972 | .885 | .989 | .925 | .962 
Part Lib )).8750) 933) |) 8698) 9308 1-902" 7.9485 )869) |) 930 15-879) 15.936 


T-Pit--ri 1952) | .975 | .982 | .965 | .943 | .971 || .961 | .980 | .945 || .972 


Spanish Test 


PartI. . | .822 | .902 | .831 | .908 | .830 | .908 | .843 | .915 | .882 | .908 
Bart Ti 7 Ole eSSO mI OL Sa lmOD Gale COUMntO20n |e S040 5-944 ae S6O 0 e928 
Part III . | .819 | .901 | .916 | .956 | .918 | .957 | .929 | .964 | .895 | .945 


I+II+11 | 1883 | 938 | .902 | .949 | .912 | .954 | .953 | .966 | .913 | .955 
SU EN ete a ee ee ee ee 
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dence on the reliability of the tests as possible. The variations between 
the corresponding correlations from the four samplings are, on the whole, 
slight, and strongly confirm the general results reported above in Table 6. 
However, we shall consider some of the differences in detail, partly to 
show how closely the correlations of Table 7 confirm those of Table 6, 
and partly to take the opportunity which these apparent differences 
offer of showing the reader concretely that correlation coefficients can- 
not always be accepted at face value. 

Standard and probable errors of estimate. — The magnitude of a given 
coefficient of correlation depends, among other things, upon the range of 
talent, in other words, upon the spread or variability of the scores con- 
cerned in its calculation. The real agreement between two sets of scores 
for each of two classes may be exactly equal; but if the whole range of 
the talent in question is represented in one class, and only a part in the 
other, the correlation coefficient based on the first class will be larger 
than that based on data from the second class. In order to learn the 
true meaning of two or more correlations based upon scores of a given 
test, we must know, among other things, the spread or variabilities of the 
groups involved in the various correlations which are being compared. 

A convenient way of comparing two or more such correlations is to ex- 
press them all in terms of the same unit, that is, in terms of the units of the 
test. This is done by using the well-known formula, o12 = 0, V 1 — 7, 
on each of the correlations which are being compared, with their corre- 
sponding appropriate sigmas. Thus, in Table 7, we find that the relia- 
bility of the whole French test is 0.952 according to one sampling of 500 
students (low-average schools) and is 0.961 according to the data from 
500 students in high-average schools. Substituting the appropriate values 
in the formula, we find that o1,, or the standard error of estimate, ac- 
cording to the first sampling is 5.91, and according to the second it is 
6.66, which leaves us with a very slight and negligible difference. The 
corresponding correlations for the Spanish test, 0.883 and 0.953, similarly 
reduce to standard errors of estimate of 6.33 and 5.75 raw score-units. 

By this method, correlations based upon data from the same or equiva- 
lent tests, in which the scale units are equal or very nearly equal through- 
out the whole range of the scores involved, can be made comparable 
because they can be expressed in identical units. But correlations based 
upon data from different tests cannot be compared unless the relations 
of the units of the compared tests are known, or unless the compared 
tests have been administered to the same, or to an equally variable, group 
of examinees. Some writers have erroneously supposed that the relia- 
bility coefficients of any two tests could be compared if both were ex- 
pressed as standard errors of estimate. The essence of comparability is 
that the quantities compared be expressed in identical units: correlation 
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coefficients are not directly comparable unless they are based upon data 
from the same or equally variable groups of examinees; standard (or 
probable) errors of estimate are not directly or immediately comparable 
unless they are expressed in terms of the same or equal scale units. 


VALIDITY OF TESTS 


Since no other modern language test was given to the students involved 
in this survey, it is not possible to check the validity of the whole test 
against an independent criterion of achievement in modern language 
work. There is, however, little need for verifying the validities of Parts 
I and III; vocabulary and grammar tests of this sort have been so thor- 
oughly tried out and proved that no scientist or teacher who has kept 
pace with recent developments can doubt their qualities. This form of 
vocabulary test began to be used about five years ago, and is now ex- 
tensively used by modern language teachers all over the country. The 
grammar-completion form has been in successful use much longer; it has 
been considered an indispensable part of modern language examination 
technique by many teachers for many years. The particular form of 
grammar-completion test used in this survey retains all the values of 
older forms; and adds some values and conveniences which older types 
of the completion method did not possess. 

Intercorrelations of Parts of new-type tests. — The only real innovation 
in these tests is the form of the reading-comprehension Part. As an indi- 
cation of its validity we have calculated the correlations between Part II 
scores on the one hand, and Part I and Part III scores on the other; 
and as a sort of ‘‘control,” we have also found the relationship between 
Part I and III scores. Part II of the French test gives correlations of 0.88 
and 0.80 with Parts I and III, respectively; and Part II of the Spanish 
test gives corresponding correlations of 0.84 and 0.74. The “control” 
correlations between Parts I and III are 0.80 and 0.728 for the French 
and Spanish tests, respectively. The number of cases on which each of 
these coefficients is based is 2000. If we accept the validity of Parts 
I and III, we must accept that of Part II. On the whole, the relation- 
ships are high enough to vindicate Part I for general measurement pur- 
poses, and low enough to show that it is not a mere duplication of Parts I 
and III. 

Table 8 shows intercorrelations between the Parts of the French and 
Spanish tests based upon four five-hundred-pupil samplings as well as 
upon the aggregate of 2000. The differences between the results from 
the four samplings are negligible; they are due to slight differences in the 
variabilities of the four groups and to the fact that some students in some 
schools were not compelled or encouraged to distribute their time equably 
between the three Parts. 
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INTERCORRELATIONS OF THE PARTS OF THE JUNIOR HIGH SCHOOL TESTS 
UPON FOUR 500-PUPIL SAMPLINGS AND UPON THE AGGREGATE OF THESE FOUR 
EACH OF THE FOUR SAMPLINGS CONSISTS OF EQUAL NUMBERS OF 8A, 8B, 9A, 
THE FOUR SAMPLES WERE MADE UP SO AS TO REPRESENT RESPECTIVELY 
SCHOOLS WHICH TESTED FAR BELOW, A LITTLE BELOW, A LITTLE ABOVE, AND FAR ABOVE 


STUDENTS. 


TABLE 8 


THE CITY-WIDE AVERAGE, IN ALL FOUR CLASSES. 


French Test 


NEW-TYPE MODERN LANGUAGE TESTS 


BASED 
SAMPLINGS. 
AND 9B 


Panes eek Low Middle-Low | Middle-High| High Unselected 
N 500 500 500 500 2000 

Parts 

Correlated 2 3 2 3 2 3 2 3 2 3 
1 870 | .813 | .876 | .765 | .894 | .823 | .876 | .883 883 810 
2 .803 .796 854 .812 .801 

Spanish test 

Parts 2 3 2 3 2, 3 2 3 2 3° 

Correlated 
1 78 .623 | .842 723 85 128 .828 | .830 .84 hoe: 
2 61 774 -702 .800 .728 


IV 


ANALYSIS OF INDIVIDUAL QUESTIONS IN NEW-TYPE TESTS 


7 DIFFICULTIES OF INDIVIDUAL QUESTIONS 


As was indicated on an earlier page, the materials for these tests were 
selected carefully so as to put preponderant emphasis on the fundamentals 
of vocabulary, grammar and idiom which are most nearly common to all 
courses of study, and at the same time provide an adequate range of 
difficulty, by including some questions easy enough to differentiate among 
the most backward students at the end of the first semester, and some 
difficult enough to differentiate among the best students at the end 
of the fourth semester. That the tests are adequate for the range of 
talent represented by junior high school students has been shown by the 
distributions of total scores in Tables 3 and 4. 

Individual questions experimentally validated. — This adequacy was 
not achieved by the original selection alone. Large numbers of questions 
of each type were constructed on the basis of word-counts of widely-used 
textbooks, word lists in several syllabi, and syntheses of the idioms and 
grammatical elements in several first-year texts; then these questions were 
administered to an aggregate of several thousand students of French and 
Spanish in New York City junior and senior high schools. The absolute 
and relative difficulty and goodness of each individual question were 
determined upon the basis of experimental results. Many of the ques- 
tions were eliminated for a variety of reasons; the remaining questions 
of each kind were then arranged in order of difficulty and divided into 
two equivalent Forms, A and B. Form A of the French and of the 
Spanish tests are reproduced below (page 48) approximately in the form 
in which they were administered to more than 25,000 junior high school 
students in June, 1925. 

Table 9 shows how well we succeeded, by the end of the preliminary 
experimentation, in securing adequate gradations of difficulty in the items 
of the two tests. Other things being equal, that test is best in which 
the items range from very easy to very difficult and in which the incre- 
ments of difficulty are small, regular, and numerous, thus providing a 
scale without gaps, like a well-built ladder, up which the examinee may 
climb to his highest level without irrelevant hindrances and distractions. 
Table 9 shows that the French and Spanish tests with which we are con- 


cerned approximate a well-built language ladder. There is at least one 
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TABLE 9 
Difficulties of Individual Questions 


DISTRIBUTIONS OF QUESTIONS IN EACH PART oF THE FRENCH AND SPANISH TESTS 
ACCORDING TO PER CENTS OF CORRECT ANSWERS FROM 400 STUDENTS (100 FROM EACH 
OF THE 4 NORMAL CLASSES, BUT OTHERWISE UNSELECTED). THE FIRST ENTRY IN COL- 
UMN ONE AT THE LEFT SHOWS THAT IN Part I ofr THE FRENCH TEST THERE WERE 4 
VOCABULARY ITEMS ANSWERED CORRECTLY BY 80 TO 90% OF THE STUDENTS; THE LAST 
ENTRY IN THAT SAME COLUMN SHOWS THAT THERE WERE 3 VOCABULARY ITEMS COR- 
RECTLY ANSWERED BY LESS THAN 5%, BUT NOT ONE ITEM ANSWERED BY LESS THAN 
1.5% OF THE STUDENTS. 


NuMBERS OF QuEsTIONS ANSWERED CORRECTLY BY INDICATED PROPORTIONS 
or UNSELECTED STUDENTS 

PER CENT OF 

CorRECT 

ANSWERS French Test Spanish Test 

Part I Part II Part III Part I Part II Part III 

90-100% 1 1 
80-89.9% 4 Zz 6 
70-79.9% 7 2 13 2 1 
60-69.9% 9 5 1 7 3 
50-59.9% 13 8 8 8 6 1 
40-49.9% 14 15 @ 14 £5 5 
30-39.9% 18 11 2 te 12 fs 
20-29.9% 13 11 14 10 10 11 
10-19.9% 12 5 10 26 10 12 

5-9.9% 7 9 3 2 16 

1.5-4.9% 3 6 1 9 

5-1.49% 2 
0-0.5% 1 
Totals 100 60 60 100 60 65 


question in each test that was answered correctly by 90% or more of the 
students, and there were two questions in each test that were answered 
correctly by not more than 1% of the students. About 14% of all the 
questions were answered correctly by less than 10% of the students. As 
was fully expected, Part III in both tests is considerably more difficult 
than the other Parts. Only one Part III question in each test was an- 
swered correctly by more than 60° of the students. But this does not 
vitiate the validity and reliability of Part III: the important thing is 
that a test be smoothly scaled in difficulty, and while there are no very 
easy items in Part III, they are well-distributed from ‘60%-correct” 
degree of difficulty down to less than “1%-correct-answers” degree of 
difficulty. In this connection, we must also remember that writing is not 
emphasized as much in the first two semesters as in the last two semesters. 

Order of difficulty correlations. — The items of each Part in each test 
were arranged in orders of difficulty as determined by the per cents of 
correct responses given by 100 8A, by 100 8B, by 100 9A, and by 100 9B 
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students. The correlations between two orders of difficulty based upon 
responses of contiguous classes, e.g. 100 8A and 100 8B students’ responses, 
are consistently higher than the correlations between orders of difficulty 
based upon responses of e.g. 100 8A and 100 9B students. The correla- 
tions between orders of difficulty when each order is based upon responses 
of 100 students range from 0.60 to 0.95, with an average of about 0.75 
for Parts I and II, and about 0.85 for Part III, in both tests. When 
orders of difficulty are based upon responses of 200 students, the correla- 
tions range from 0.87 to 0.96, with an average at about 0.92 for both 
tests. 

If these relationships are confirmed by more extensive investigations, 
they indicate that modern language learning is highly integrated, or very 
homogeneous, from the testing viewpoint; the neural bonds that consti- 
tute effective language ability are highly interrelated and interdependent. 
But this evidence may be partially or entirely contradicted by correlations 
between orders of difficulty based upon responses of two groups of students 
using different syllabi. On the other hand, it may be that all learning 
of a given language, regardless of differences between texts and syllabi 
and methods of instruction used, is characterized in its fundamentals by 
a fairly large and unvarying core of common neural bonds. In order to 
enable other users of these tests to make comparisons with orders of diffi- 
culty based upon responses of New York City junior high school students, 
the absolute and relative difficulties of each item in both tests are indi- 
cated at the left of each item in columns 1, 2, and 3, on pp. 48 to 85. 

Reproduction of the tests.— With the exception of the five! columns of 
figures at the left, which will be explained shortly, and excepting the size 
of the type and spacing of the lines, the tests are reproduced below ex- 
actly as they were administered to New York City junior high school 
students in June, 1925. Misprints and other errors are faithfully repro- 
duced so that the reader may see exactly what was set before the students. 
As given to the students, the questions were in 10-point type throughout, 
with double leads between lines of type in the same question and extra 
leads between questions. The lines were four inches long, and the whole 
test was printed in a 12-page booklet, 814 by 1014. In Part III lines on 
which the students wrote their responses were 134 inches in length, and 
spaced at least 3¢-inch apart vertically. These tests have been pub- 
lished, and may be secured in quantity, with manuals of directions and 
norms, from the World Book Co., Yonkers, N. Y. They are called the 
American Council Beta French and Spanish Tests. 


1 Seven for Part I of the French test. The columns of figures at the left of the test questions will be 
discussed in detail on pages 86 ff. The tests are reproduced on the following pages: 
French: Part I, 48-54; Part II, 55-61; Part III, 62-66. 
Spanish: Part I, 67-72; Part II, 73-79; Part III, 80-85 
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New York City Junior High Schools 


COMPREHENSIVE SPANISH EXAMINATION: Form A 


SCHOOL SEES eh en wisiis we OATS Gis onthe Ba Date, June..... , 1925 

Py CVU ae TUT er ie Bc ay ee ch dae ieee ait we ca an Aa eee 
(First Name) (Second) (Last) 

LC ae years. ..... months. TSO Voni onic 23! Golan! (check one) 


I am just completing Spanish 8A, 8B, 9A, 9B, RA, RB, RC, RD (circle one) 


Name of teacher 


ate e) Ore pee Wher ack 6, moe eye a ors enelal ie ste RNa wae elie’ a) ales Mie (dete, V8 O)\8..o) etal se 


Directions To Srupents: You will be allowed thirty minutes on each of the three 
parts of this examination. When you are told to start, turn this page, read the directions 
for Part I carefully, and answer all the questions that you can in Part I in the time al- 
lowed. When the teacher says, ‘Begin Part IJ,” turn to Part II even if you have not 
finished Part I. You are not supposed to be able to answer every question; answer only 
those that you know and don’t worry or waste time on the others. If you finish Part I 
before time is called, look over your work and correct any mistakes you may have made. 


Score 
Part I 
Part II 
Part III 


Total 


Grade 


Conyrighted: Reprinted by Permission 
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The first column of figures at the left of the test-questions shows the 
per cent of students that answered each question correctly. These per 
cents are based upon the responses of 400 students, 100 from each of 
the four normal classes. They constitute indices of the absolute diff- 
culties of the items, and may be accepted as fairly reliable, since we 
learned above that the reliability of difficulty indices based upon only 
two hundred cases is about 0.92. Thus we learn (column 1, page 49) 
that paysan is understood by 78.8% of the 400 students whose responses 
were studied in detail, to mean peasant, and not paint, country, plow, 
or safety. The second French vocabulary item, appeler, was answered 
correctly by 38.8% of the students. 

The second column gives the order of difficulty of the items within each 
Part. Thus we learn that paysan is the fifth easiest item in the 100-item 
vocabulary test; that appeler is the fiftieth, écouter the fourth, courir the 
thirty-eighth, and that argent is the easiest of the 100 vocabulary items. 
Similarly we learn, from column 2, that La capitale de la France . . . is the 
easiest item in Part II, and donnez-moi the easiest item in Part III of 
the French test. The most difficult vocabulary item is No. 93, renoncer; 
only 3% of the students answered it correctly. These few answered 
correctly in spite of the fact that the word abjure was printed as adjure. 
The most difficult item in Part II is No. 57, and in Part III No. 57, also. 

The third column shows the order of difficulty of the items when all 
three Parts are merged. Thus paysan is the eighth easiest item in the 220 
questions that make up the French test; La capitale de la France... , in 
Part II, is the easiest of all the 220 questions in the French test. The 
most difficult of the 220 hurdles of the French test is No. 57 in Part III. 

The reader may secure some amusement, and perhaps profit, by in- 
dulging in a little guessing game as to the relative difficulties of any two 
contiguous items in these two tests. He will have a concrete illustra- 
tion of the difficulty of constructing examinations ‘‘by guess,” as old-type 
examinations have necessarily been constructed in the past. 

Columns 4 and 5 will be explained in connection with the discussion of 
“good” and “bad” questions, in the following paragraphs. 

Distribution of Henmon and Wood frequencies of words in French 
vocabulary test. — Columns 6 and 7 for Part I of the French test were 
put in as an additional check on the difficulty and validity of our vocabu- 
lary test items. Column 6 shows the number of times each word oc- 
curred in 400,000 words of running discourse, according to the Henmon 
French Word Book (q.v.):! Column 7 shows the number of textbooks 2 
in which each word appeared. Table 10 summarizes the facts of columns 


1 University of Wisconsin Research Bulletin No. 3, Sept., 1924. See above, pp. 49-54 


2 Wood, A Comparative Study of the Vocabularies of Sixteen French Textbook 
, , s, The M 
Journal, February, 1927. ge iapenaze 


JUNIOR HIGH SCHOOLS 87 


6 and 7, and shows that although every word in our French vocabulary 
test occurs in four or more standard and widely used French textbooks, 
fifteen of them do not occur in the Henmon list. This means that 85 
of the 100 words in Part I occur five times or oftener in 400,000 words of 
running discourse; and that 15 occur less than five times. 


TABLE 10 


DISTRIBUTION OF FRENCH WoRDS IN Part I or THE JUNIOR HIGH SCHOOL FRENCH 
TEST ACCORDING TO HENMON FREQUENCIES AND ACCORDING TO NUMBER OF FRENCH 
TEXTBOOKS IN WHICH EACH WORD OCCURS, 


HENMON FREQUENCY: CANS Gee IEA Woop FREQUENCY: ayn, GYD Ie 
He 100,000, Wonne, | “Faexor Wonos | | Nosppr oy Neewon | NGSGNGg Wonoe | 
0 15 4 10 

5-9 10 5 11 
10-19 17 6 11 
20-29 ot fl ss 
30-39 6 8 8 
40-49 2 9 

50-59 7 10 1 
60-69 3 11 I 
70-79 2 12 9 
80-89 4 13 9 
90-99 1 14 10 

100-119 4 15 9 

120-139 3 16 1 

140-159 2 

160-179 3 

180-199 

200-299 2 

300-399 3 

400-499 2 

500 + 3 


Intercorrelations of Henmon and Wood frequencies and empirical 
difficulties of words in French vocabulary test. — The correlation of the 
empirical order of difficulty (column 2, pp. 49-54) of these 100 French 
words with the Henmon frequencies of column 6 is 0.278; and with the 
Wood frequencies of column 7 is 0.336. The correlation between col- 
umns 6 and 7 is 0.832. These correlations have considerable significance 
for the content and organization of the curriculum as well as for the 
theory and practice of modern language testing. It is clear that under 
present conditions students learn many words which occur only once or 
twice in 400,000 words of discourse better than they learn other words 
which will be met with much oftener in general reading. 


VaLipITy oF INDIVIDUAL QUESTIONS 


Definition of a “good” question. — On the basis of the responses of the 
400 cases used in establishing the order of difficulty discussed above, 
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each of the 220 or 225 items in the French and Spanish tests has been 
checked up for validity against two external criteria. The first criterion 
was the class-placement, 8A, 8B, 9A, or 9B, of the students; the second 
was the total score on the French or Spanish test. The 400 cases were 
divided into four groups of 100 each, Group I being composed of the 100 
who secured the lowest scores, regardless of what class they were in, and 
Group II the next 100 lowest scores, ...and Group IV of those who se- 
cured the highest 100 scores in the sample of 400 cases. A question is 
considered valid or ‘‘good”’ if it consistently differentiates between groups 
of students which are known to be arranged in order of merit, that is, 
if it gives progressively higher percentages of correct answers to the four 
groups just described, from 8A to 9B, or from I to IV, in order. A question 
is considered ‘‘bad”’ if it does not favor the better groups, and is consid- 
ered very bad if it actually gives more credits to the 8A or Group I stu- 
dents than it gives to the 9B or Group IV students. 

Possible causes of “inverted”? questions. — Those who have not made 
exact empirical studies of examination results may be inclined to doubt 
that a reasonable question could be constructed which would give higher 
scores to poor students than to good students; but such questions occur 
frequently in all types of examinations. The causes of these “inversions” 
are complex, and vary from question to question. We shall in following 
pages make some suggestions as to the probable causes of the few inver- 
sions which we have found in these tests; the only generalization that 
seems possible is that the inversions may be genuine or false: genuine, if 
the advanced students have really forgotten something which they learned 
very well in an earlier semester; false, if the low scores of the advanced 
students are due to some subtle ambiguity in the question itself, in the 
directions which have been given to the students, or in some prepotent 
counter-suggestion in the setting of the question which takes effect with 
the bright students but passes unnoticed by the dull ones. There can 
be no doubt that advanced students do forget some things they have 
learned well in early semesters, because of inadequate articulation of 
. lower with higher class work, and because of the efforts of teachers to 
follow too mechanically an artificially segmented syllabus. 

“Inversions” should be detected and eliminated. — Whatever the cause 
of “inverted” questions may be, “inversions” should be detected and 
eliminated from all examinations the results of which are likely to affect 
the education of any number of students. It is due to the presence of 
such questions in old-type examinations, among other unfortunate fea- 
tures, that old-type examinations afford such unsatisfactory reliability 
and validity coefficients. The purpose of an examination is to separate 
the sheep from the goats; and this is the purpose of each question in an 
examination, A question which gives equal credits to all students, low 
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and high, does not serve the purpose for which the examination was 
made and administered; but it does no particular harm, aside from wasting 
the time and energy of both students and teachers. If, however, a ques- 
tion actually gives higher scores to poor students than to good students, 
its influence runs counter to the purpose of the examination; it pulls the 
sheep downward, and pushes the goats upward. Inversions cannot be 
detected without actual trial, and no examination should be accepted for 
use on a city- or state-wide basis which has not stood the test of experi- 
mental verification. 


TABLE 11 
Validities of Individual Questions in French and in Spanish Tests 


SHOWING THE NUMBERS OF ITEMS IN EACH PART OF EACH TEST OF EACH DEGREE 
oF ‘“GOODNESS”’ OR “‘BADNESS,”’ AS DEFINED IN THE TEXT. THE FIRST COLUMN, HEADED 
IV—I, sHOWS THE INDICES OF ‘GOODNESS’? IN TERMS OF THE DIFFERENCE BETWEEN 
“PER CENT OF CORRECT RESPONSES BY Group IV sTUDENTS”’ AND “PER CENT OF CORRECT 
RESPONSES BY Group I STUDENTS.” 


French Test Spanish Test 
IV%-1% Part I Part II Part III Part I Part IT Part III 
80-90 3 4 1 3 
70-79.9 8 11 6 1 4 
60-69.9 17 15 a 9 7 
50-59.9 16 BI 12 3 14 ‘ai 
40-49.9 5 10 “9 9 11 8 
30-39.9 11 3 8 6 10 6 
20—29.9 14 3 4 2s, 9 11 
10-19.9 15 2 8 27 6 12 
5-9.9 2 1 11 3 
0+4.9 4 1 + 12 4 
=fto —£9 3 3 
—5to —9.9 2 4 
—10 to —19.9 | 2 
—20 to —29.9 1 


As a rough measure of the goodness or badness of individual questions 
we have compared the percentages of 9B (or Group IV) students with the 
percentages of 8A (or Group I) students that answered each question cor- 
rectly. If a larger proportion of the 9B students answer a given question 
correctly than of 8A students, the question is good; and its degree of 
goodness is roughly indicated by the amount of the difference. Column 
4 shows for a given question in each test the difference obtained by sub- 
tracting the per cent of Group I students answering it correctly from the 
per cent of Group IV students answering it correctly. (See above, pp. 
49-85. The 9B-8A differences were found to be so closely parallel to 
the IV-I differences that only the latter need be presented.) If a figure 
in column 4 is positive, the question to which it applies is a good one, 
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for it means that good students answered it correctly more often than 
poor students; if a figure is preceded by a minus sign (—), the question 
is a bad one, because poor students answered it correctly more often than 
good ones. Indifferent questions are those which give only small posi- 
tive differences in favor of the advanced group of students. A summary 
of the facts of column 4 is presented in Table 11. 

Proportions of “good” and “bad” questions in the tests. — From the 
first entry in the column headed French Part I, in Table 11, we learn 
that there were three French vocabulary items which were answered 
correctly by an excess of Group IV over Group I students of between 
80 and 90 per cent. The second entry shows that there were eight vo- 
cabulary items for which the excess of correct answers in Group IV over 
correct answers in Group I was between 70 and 80 per cent. In the whole 
French test there are only 17 items for which the excess of Group IV cor- 
rect responses over Group I correct responses is less than 10%; in other 
words, 203, or over 92% of the 220 items in the French test, wield effec- 
tive differentiating power, and only 5, or 2.3%, of the 220 items differenti- 
ate in the wrong direction, and these 5 items are only slightly inverted. 

Practically the same proportions hold for the Spanish test in so far as 
Parts II and III are concerned; the Spanish vocabulary test is less for- 
tunately constructed. No Spanish vocabulary item yields a difference in 
favor of Group IV above 60%, and only 67 of the vocabulary items wield 
significant differentiating influence, and 10 wield at least a slight influence 
in the wrong direction. On the whole, the difference between the French 
and Spanish tests here displayed is not very great; Table 11 indicates 
that the items of both tests have a very high “batting average’’; but it is 
interesting to note that Table 11 favors the French test to about the same 
extent as the reliability coefficients of Tables 6 and 7 do. This is a striking 
example of internal or analytical evidence confirming total correlation 
results, and illustrates the importance of and the need for meticulous 
care in validating each and every individual item in a test. Only by 
such care can we construct tests such that each item and each moment 
of examination time will contribute to the purpose of the examination 
in terms of real and effective differentiation between the various levels 
of achievement from lowest to highest. 

Analysis of some “bad” questions. — Column 5 shows the order of 
goodness of all the items in each test. The item in each test which dif- 
ferentiates between the poorest and the best students most adequately is 
numbered 1 in column 5;’and the item in each test which most tended to 
give advanced students low scores and poor students high scores is num- 
bered 220 in the French and 225 in the Spanish test. (See pp. 49-85.) 

The worst of all the items in the two tests is Spanish vocabulary item 
No. 95, tender, to stretch. The excess of poor students over good students 


JUNIOR HIGH SCHOOLS QI] 


who answered this item correctly is 23%. The reader needs only to look 
at this item in order to convince himself that it is a very faulty question: 
but it is very doubtful whether any Spanish scholar would have picked it 
as a bad one in a casual, or even in a critical, inspection of the test. This 
item, and several others like it, escaped the vigilance of several scholars 
who were kind enough to review our tests critically. At least two of the 
alternative English words contain illegitimate seductive qualities: tender 
is spelled exactly like the Spanish word, tender; and hold is the equivalent 
of the Spanish word tener, for which tender might be easily mistaken by 
the good student who is working at top speed. 

The next few worst items in the Spanish test, vocabulary items 24, 47, 
91, 19, 52, 63, and 100, all contain unnecessary and questionable seductive 
elements in the English alternatives. The worst items in the French 
test are in Part I, Nos. 40, 94, 85, 91, and 68. The faults of the French 
items are less obvious than those of the Spanish vocabulary items; the 
reader may verify this judgment by comparing the French and Spanish 
items in detail. 

Graphs of “good” and “bad”? questions. — The method described above 
for detecting good and bad test elements is only approximately adequate; 
we have used it and presented the results of such use only as a rough 
means of showing concretely the need for studying individual items care- 
fully in constructing a test which shall be valid and reliable. In order 
to be really good, a test question must not only show a difference of cor- 
rect responses in favor of the highest quarter in comparison with the 
lowest quarter of the students, but must give progressively greater per- 
centages of correct answers as we ascend the scale of achievement, and 
the increments in correct responses should be fairly regular and nearly 
proportional. In Chart 14 are shown graphs of the increments in cor- 
rect responses from the first, second, third, and highest quarters, for the 
best question in each test. 

The first increment curve, labelled (a), is for French vocabulary item 
No. 60; it is about the most ideal question that it has been my fortune 
to deal with. About 7% of the lowest quarter of the students answer 
it correctly; the per cents then go up to 32% for the second quarter, 66% 
for the third, and 93% for the highest quarter. Curve (e) shows that 
item No. 4 in Part II of the Spanish test is also very nearly ideal. Curves 
(c) and (f) show practically no increment of correct responses from the 
first to the second quarter, or from the 8A to the 8B semester. This is 
quite characteristic of many Part III items in both tests; students in 
junior high school do not apparently begin to learn to write in a foreign 
language until after the second semester. . 

Unfortunately, the increment curves of Chart 14 are not quite typical 
for our tests; but, as may be surmised from Table 11, the typical curves 


92 NEW-TYPE MODERN LANGUAGE TESTS 


for our test items are much nearer to these than they are to those shown 
in Chart 15. In the latter chart we show the increment, or rather de- 
crement, curves for the worst item in each Part of each test. 

The first curve in Chart 15 shows that French vocabulary item No. 94 
is answered correctly by about 34% of the lowest quarter, 42% of the second 
quarter, 40% of the third quarter, and by only about 26% of the highest 


French 
Part 1-60 Part Il-13 Part II|-7 
100 100 00 
Cc 
ay 0 
Ooms UE Iv TT ae 
Part II7+4 Part IIl—10 
100 100 + 
e f 
50 50 
| A , 
. th ia iy ih wa en eae me's 


Cuart 14. — Showing for the best question in each Part of each test, the percentage 
of correct responses from each quarter of an unselected sample of 400 students. The 
sample includes 100 students from each of the four normal semester groups. Group I is 
made up of the 100 students in the sample who secured the 100 lowest scores; Group II 
of students making the next 100 lowest scores; Group IV includes the students in the 
sample of 400 who made the highest 100 scores. The increment curves when the ab- 
scissae represent the 8A, 8B, 9A, and 9B classes practically coincide with the curves as 
shown here. Curves (a), (b), and (c) relate to French test items; (d), (e), and (f) to 
Spanish test items. 


quarter! The corresponding per cents for 8A, 8B, 9A, and 9B classes are 
39, 40, 59, and 23, respectively. There is no apparent trick element in 
this item, so that it is quite possible that students learned the word dé- 
montrer in the first semester, retained it through the third semester, and 
forgot it, to a certain extent, through disuse during the fourth semester. 
This tendency is still more pronounced in Spanish vocabulary items Nos. 
91 and 95, although in these items there are counter-suggestions to the 
students which may account for their being inverted. However, these 
graphs and remarks are included here merely as suggestions which may 
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help in disseminating a more adequate appreciation of both the impor- 
tance and the very real difficulties of constructing valid and _ reliable 
modern language tests. Studies of the consistency and economy and 


French 
Part 1-94 x Part Il-1 Part III—53 
. 0 


100 


Spanish 
Part 1-95 Part IIl-63 
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Cuart 15. — Graphs illustrating the lack of validity of some of the least effective 
questions in the new-type French and Spanish tests. 


articulation of modern language courses can be put on a sound basis only 
when it is possible to administer objective, reliable, and comparable tests 
to the same students at various stages of their progress in modern language 
work. 


V 
SUMMARY AND RECOMMENDATIONS 


Importance and difficulty of constructing good tests. — In immediately 
preceding paragraphs we have compared the percentages of correct re- 
sponses of several groups of different students, and not of the same stu- 
dents at different times. Real knowledge about the way in which our 
students learn, or fail to learn, a modern language as presented in high 
school and college courses will be achieved only when we have traced the 
specific modern language achievement of the same individual students 
through several years of effort by means of comprehensive and comparable 
tests frequently administered. This has not been possible thus far be- 
cause of the subjectivity and unreliability of the traditional forms of 
examinations. A large part of the weakness of the old-type examinations 
is inherent in their form; but some of it at least is due to the casualness 
with which they are constructed. The usual run of old-type modern lan- 
guage examinations are constructed by a single teacher, or by a com- 
mittee of two or three or four teachers, in one or two hours. The French 
and the Spanish tests with which we are concerned in this report are, on 
the one hand, a part of the results of four years of continuous experiment- 
ing with various forms of questions and of analytical work with French 
and Spanish textbooks; and, on the other, are the specific result of two 
months of work during which preliminary lists of questions were admin- 
istered to several thousand junior and senior high school students. The 
prompt casualness with which old-type examinations are made denotes 
either a lack of appreciation of the importance of good measurements, or 
of the real difficulties of making passably reliable examinations, or both. 

The essential purpose of examinations. — One of the stumbling blocks 
in the way of progress in examination making has been the widely ac- 
cepted idea that students should be examined or tested only upon those 
particular and specific parts of the subject matter upon which the teacher 
had instructed them, or thought he had instructed them. It has been 
thought that any other kind of examination would be unfair, and that 
theoretically it should be possible for any student who had done his work 
acceptably to make a mark of 100% on every examination given to him. 
There is, in fact, no objection to the teacher’s amiable desire to know 
how much of his or her particular offerings the students have absorbed; 
but the administration of the school, the student, the parents, and society 

94 
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at large have a right to know how well a student at the end of a given 
course can, e.g., read French prose, regardless of whether he learned it 
from the teacher of that course, or in spite of the teacher. In other 
words, the interests of all concerned demand examinations which measure 
relative achievement in the subject matter, and not relative achievement 
in the particular compote which students have “had” from particular 
teachers. It is not a question of the fairness of the examination to 
the student, but of getting at educationally significant truth and expressing 
that truth in understandable and usable terms. Nothing could be more 
unfair to students than giving marks which are almost certain to be 
misinterpreted and misused, which is what happens very often when 
marks are assigned on the basis of individualistic or local standards. 
Criteria of good examinations. — The fundamentals of good examina- 
tions are validity, reliability, comparability and administrative feasibility. 
These have been constantly kept in mind at every step in the construction 
of our tests. We have tried to make our tests valid by basing our ques- 
tions on materials which word-counts and analyses of textbooks and of 
syllabi have shown to be important, if not in fact indispensable, common 
essentials; and by making our sampling of the common essentials adequate 
not only as to extent of materials, but also as to variety and depth of 
learning-units tapped. We have tried to make our tests reliable by in- 
sisting on large numbers of relatively independent questions, the answers 
to which involve few irrelevant activities on the part of the student, and 
which can be scored objectively, accurately, and expeditiously. We have 
tried to make our tests yield comparable measurements by making them 
valid and reliable tests adapted to the whole range of achievement in the 
first three or four years of modern language work, and by constructing 
several equivalent Forms at the outset in such a way that additional 
Forms equivalent in difficulty and variability may be made as needed. 
And finally, we have tried to make our tests administratively feasible by 
putting the questions in such a way that the tests are largely self-adminis- 
tered, and by arranging spaces for the students’ answers such that the 
scoring is not only objective, but accurate and expeditious. The real cost 
of these tests when administered to large groups is less than half as much 
as that of old-type tests, which are not half as reliable and valid. 
Although these objective tests require only ninety minutes to administer, 
they include 220 or 225 independently scored items. In the French test 
the vocabulary sampling alone includes 538 different root words. Old- 
type 90-minute tests ordinarily include a vocabulary sampling of less 
than 200 different root words. Disregarding duplicates, we have 100 
words in Part I as a test of passive vocabulary; 477 in Part IT, the reading- 
comprehension test of 60 carefully graded items; and 80 in Part TIT as an 
active vocabulary test in highly varied contexts. Part III is primarily a 
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grammar test of significant and varied content; but in addition to being 
this and an active vocabulary test, it is a fairly good test of idioms, order 
of words, agreements, hyphens, elisions, spelling, use of apostrophes, and 
other marks, etc., which are usually not covered in objective tests, and 
only imperfectly in old-type examinations. It is this combination of 
features, in brief, which makes these tests yield comparable and valid 
measures two or three times as reliable and only half as costly as old- 
type examinations of two or three hours’ length yield. 

Some limitations of the new-type tests. — These tests do not pretend 
to measure the oral and aural skills or cultural content. They are de- 
signed to measure only knowledge of the written language, and measure 
other things only in so far as they are implied by, or are dependent upon 
knowledge of vocabulary, reading and grammar. In any case, measure- 
ments of oral and aural skills and cultural content should be kept separate 
from one another and from measures of knowledge of the written language. 
They are three relatively independent functions; it is obvious that stu- 
dents often possess one of them without the other two, and failure or 
success in one should never be confused with, nor reported as, failure or 
success in another. 

Moreover, paper-and-pencil tests of oral and aural skills do not seem 
practicable at the present time. The best way to measure these is by 
means of conversations with students, one at a time, using carefully 
prepared sets of questions and conversational materials. Such measures, 
however, would be subjective, would be mixtures of many relevant and 
irrelevant qualities in both student and teacher, and would not be com- 
parable from place to place and year to year. These facts are not cited 
as arguments against the use of such tests, but as arguments in favor of 
keeping the marks from such tests separate from the marks derived from 
reliable and objective tests. To mix such marks by averaging would be 
about as illuminating as to average reliable measures of height with guessed 
weights of students in order to secure a total physical growth index. 

Some objections considered. — There are few teachers who do not 
admit that the objective tests are better as measuring devices, but some 
teachers fear that the objective tests are pedagogically unsound, and that 
they will tend to mechanize teaching and produce what is called ‘dead 
uniformity.” Specifically, it is feared by some that the objective passive 
vocabulary tests will cause students, aided and abetted by their teachers, 
to “memorize mere lists of words.” Thus it is feared that students might 
make a serious breach in future objective tests by the simple and effective 
device of memorizing the small matter of the two or three thousand most 
frequently used words in each language! The reader can judge for him- 
self whether such a result would be unpedagogical and calamitous, or an 
unhoped-for blessing. It may well be that the objective vocabulary test 
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may produce such a miracle; but the proponents of new-type tests have 
never been optimistic enough to hope that this charge against their tests 
would turn out to be true. 

It is also difficult to see how these tests would tend to mechanize teach- 
ing. Indeed, it is a question whether the great weight put upon transla- 
tion in the old-type examinations does or does not make a boomerang of 
this charge against the new-type tests. The translation work in modern 
language classes, after many years of trial, is certainly not above suspicion; 
many eminent authorities have found fault with it, and recent investiga- 
tions tend to confirm their views.! On the other hand, there is no evi- 
dence against the objective tests: all the charges made against them seem 
to be bold prophecies spun out of conservative fears. There are not 
even any a@ priori grounds for imagining that the kind of objective tests 
here proposed will exercise unpedagogical influence. Good teachers will 
use the best available pedagogical devices, regardless of the merits of 
such devices for testing purposes; and so examiners should use the best 
available measuring devices, regardless of what their merits as teaching 
devices may be. Lack of measuring value does not destroy the peda- 
gogical utility of a device; nor does lack of pedagogical value in a good 
measuring device destroy its value for examination purposes. But we 
do not need to rely on this distinction. On the contrary, it seems almost 
certain that the new-type tests will be as great a boon to teachers as to 
examiners. If new-type tests of the kind here proposed were a regular 
part of the examinations used in our schools, both teachers and students 
would know that the results of their efforts would be accurately and fairly 
measured, without the possibility of personal bias being exercised against 
them or invoked in their behalf; and both teachers and students would 
know that the whole test could not be passed with vocabulary alone, 
unsupported by knowledge of grammar, idioms, reading ability, etc. It 
is obvious that the use of objective tests would not necessarily preclude 
the judicious use of other types of tests for the measurement of modern 
language factors which are not taken care of by new-type forms of questions. 

It is also charged that the objective tests do not measure the spiritual 
and cultural gains of the students from modern foreign language study. 
In so far as these gains are related to or dependent upon knowledge of the 
language itself, the objective tests measure them better than the old- 
type examinations do, for the reason that the former give fuller and more 
accurate measures of language-proficiency than the latter do. In so far 
as these gains are not related to or dependent upon knowledge of the 
modern language itself, neither type of test should be influenced by such 
gains. This does not mean that teachers should not judge the moral and 


1 The researches of Briggs and Miller, Woodring, Leonard, and others, are all in essential agreement 
in indicating the deleterious effects of translation exercises on English composition. 
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spiritual characters of their students, if they feel competent to do so; 
it does mean that such judgments should not be called French, or Spanish, 
or German, etc. That the objective tests do not mix moral appraisals 
with measures of defined achievement is one of their greatest merits; it 
is certainly not a defect to avoid confusion and obscurity in our school 
grades. 

“Dead uniformity” is another specter which some critics pretend to 
see in the wake of objective and reliable measurements. Some of the 
arguments of those who fear dead uniformity sound dangerously like 
pleas for examinations which will be sure not to give accurate or meaning- 
ful measures — as if progress depended upon ignorance of the facts, and 
as if accurate knowledge of defined achievement would necessarily fasten 
a uniform mold upon everything pertaining to modern language work. 
It is an ironical paradox that the eharge of dead uniformity should be 
made against a type of examination the main purpose of which is to 
enable teachers to provide for individual differences, and which is the 
first and only device thus far proposed that seems capable of enabling 
the schools to make adequate and continuous adjustments to the indi- 
vidual needs of students. Accurate and comparable measures do not lead 
to uniformity, but provide a means of escape from undesirable uniformity. 
The development of means for comparable measurements of body tem- 
perature has not led medical doctors to make uniform prescriptions for 
their patients; on the contrary, it has enabled them to vary their pre- 
scriptions more adequately to meet particular individual needs. 

Reconstruction of curriculum dependent upon use of good tests. — 
These critics can hardly desire the same chaos in examinations as now 
obtains in our textbooks and courses of study. Our textbooks are a 
veritable Tower of Babel. Under the present organization of modern 
language instruction a student practically begins a new language each 
time that the textbook is changed, and he forgets a large part of what he 
has already learned. A word-count! of sixteen widely used French first- 
year books showed a total vocabulary of about 6000 root words, with 
only 134 words common to all sixteen books! This is certainly variety, 
but pedagogically it is indefensible. If we should divide these sixteen 
books among sixteen students, and each learned his book perfectly, the 
conversation with all sixteen students present would be limited to 134 
words and their inflected forms! If those who accuse the objective tests 
of entailing dead uniformity mean that they will help to wipe out this 
intolerable chaos, as well as to reduce the equally incredible and vicious 
overlapping of our classes, then we gladly plead guilty. Accurate “and 
comparable measurements will shortly make possible the objective demon- 


1 Wood: A Comparative Study of the Vocabularies of Sixteen French Textbooks, The Mod 
Journal, February, 1927. ern Language 
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stration of the advantages of carefully constructed and articulated courses 
of study, and will gradually lead to the production of texts in which the 
increments of vocabulary and other materials from lesson to lesson and 
from semester to semester will be scaled in accordance with experimental 
findings. 

Learning students prerequisite to teaching them.— The importance 
of the use of a single adequately constructed examination for all the language 
(not literature) classes, at least to and including the fourth senior high 
school year of modern language work, can hardly be over-emphasized. 
The real purpose of good examinations is to give us accurate and significant 
information about the achievements of our students and about the re- 
sults of our teaching; while we annually spend much time and money 
and energy giving the traditional examinations, we have really not been 
getting accurate information. The variable class standards and the over- 
lappings set forth in the first part of this report show, among other things, 
how much we have neglected real measurement. 

Learning our students is just as important as teaching them; indeed, it 
is prerequisite to good teaching. It has been suggested that if one-fourth 
of the energy now devoted to teaching were devoted to learning our stu- 
dents, the remaining three-fourths would produce ten times the good 
results now achieved, and would eliminate most of the ills that attend our 
present half-blind teaching efforts. No one can estimate the sum total 
of evils which are due, directly and indirectly, to our blind shooting in 
the dark, in terms of misdirected efforts of teachers and students, in 
terms of bad habits and bad attitudes now developed in both teachers 
and students, in terms of the despair and martyrdom of students doomed 
to so-called failure, and the boredom and stultification of unidentified 
bright minds which are kept to the mediocre pace of the average. No 
one knows what prodigious results might be achieved by identifying and 
providing adequately for these bright minds, and by freeing modern 
language classes from the dismal influence of unadapted students and 
directing them into activities for which they are competent. We are now 
in the position of medical doctors who treat patients en masse, without 
troubling to make individual diagnoses or checking up on the course of 
the malady or on the effects of the treatment. The analogy breaks down 
at this point, however; because no doctor boasts of the high mortality 
rates in his practice, but many teachers do just this under the guise of 
maintaining high standards. 

The trial-period in educational guidance. — The lack of comparability 
in the measurements afforded by the traditional examinations has kept 
us in the dark concerning many modern language teaching and administra- 
tive problems. What students should be permitted and encouraged to 
study modern foreign languages? What is the least number of years 
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that a student may “take” a modern language with profit? The present 
survey of achievement in junior high schools gives us a partial answer to 
this last question, and at the same time points the way to a complete 
answer. The partial answer is that all general policies must submit to 
the ordinances of individual differences, because we have shown that 
some students learn more in one semester than many students learn in 
four semesters. The way indicated for a complete answer is a combina- 
tion of good tests of general intelligence used as prognostic devices and 
trial-periods in modern language classes in which the amount and rate 
of progress of each individual are ascertained at least once each month, 
with a view to making such trial periods as short and as effective as possible. 

There is nothing new in the trial-period idea. It has been used to a 
certain extent in the junior high schools of New York City and elsewhere. 
But it could not be used effectively so long as separate examinations were 
given to each class, and so long as uniform examinations were given only 
to the fourth-semester classes. Two years is quite a fraction of a boy’s 
total life span; yet we find some students in the fourth-semester class 
who are below the second-semester average and who are still on trial! 
The essence of the trial-period is brevity; to make trial-periods short we 
must have accurate and comparable measurements of defined achievements. 

Reasons for opposition to new-type tests. — It seems strange that the 
development of sound measuring devices should have been so long de- 
layed in the modern language field, because it seems to be as easy, if not in 
fact easier, to construct and apply good examinations in modern language 
classes, through the fourth high school or third college year, as in any 
other field of learning. The causes of this delay are complex, and include 
among other things a too great sensitivity to tradition; an exaggeration 
of the importance of ‘‘time-serving” as a condition for and a measure of 
achievement; an exaggeration of the punitive function of examinations 
and a too careless acceptance of the theory that examinations have a 
miraculous power to create and raise standards; too much emphasis by 
modern language teachers upon individualistic goals and standards of 
achievement, involving a misconception of the nature and purposes of 
standards, and often leading to a violation of the elemental pedagogical 
principle of contenuity, of going from the known to the unknown, of build- 
ing new learning on a strengthening of the old, of close articulation be- 
tween successive lessons and successive courses of study; too much em- 
phasis in the early years upon literature and what is called “cultural 
content,” and too much faith in the reality and measurableness of an 
ectoplasm called “spirit of the language,” with a corresponding neglect 
of the language itself; and finally, not to mention several others, a lack 
of appreciation of certain phases of the psychology of learning. 

Some of those who use “spirit of the language” as a conclusive argu- 
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ment against new types of tests have been intemperate enough to say that 
they were not interested in vocabulary, and have insisted on measuring the 
spirit of the language in students who did not yet know its ‘dry bones,” 
i.e., its words. It is under this same theory that students are plunged 
into foreign language masterpieces, in a hasty pursuit of cultural and 
spiritual riches, without troubling about the “crumbs that mere words 
are.” It is because the language itself is culturally of very high worth, 
and in any case prerequisite to advanced literary and cultural courses in 
the foreign languages, that our tests are confined to vocabulary, reading, 
grammar, idioms, etc., and make no pretence of measuring the ‘higher 
and imponderable realms.” If we continue our present practice of trying 
to make unprepared students sense the native beauties of a Goethe, a 
Moliére, a Cervantes, or a Dante in the original tongues when they have 
to look up every tenth word in the dictionary, and must have every other 
idiom or poetic expression explained by the teacher, our students will 
continue to leave their modern language classes with feelings toward the 
great masters as ardent as the feelings of many high school students who 
have unwillingly toiled over Shakespeare, or who have memorized Mil- 
ton’s sonnets as penances for rebelling against class exercises which seemed 
to them supreme folly. The solution is to do frankly what many institu- 
tions are now doing — give foreign literature courses in English transla- 
tion, or make sure that students can read the foreign language with a 
certain minimum facility before they are admitted to literature courses in 
the original tongues. 

This cannot be done by limiting literature classes to students who have 
“had” at least three or four years of preparatory work; this is the old 
time-serving conception of achievement and preparation which was 
born in sin and perpetuated in iniquity. Students should be admitted to 
modern foreign language literature courses when they are able to read the 
language with the requisite minimum facility, whether they have ‘‘had’’ 
it one, two, three, or four years. Students who are not minimally pre- 
pared should not be so admitted regardless of how many years of mod- 
ern language work they have “taken.” 

What the minimum of facility in reading is which is prerequisite to 
successful work in foreign literature courses must be determined by ex- — 
perimental research. Informing research in this direction depends upon 
exact and comparable measurements of defined traits, abilities, and skills. 
Such measurements were impossible as long as we had to depend wholly 
upon the subjective and impressionistic old-type examinations, made up 
and scored by teachers who were jealous of their individual aims and 
proud of their unique “standards.” 

There can be no doubt about the necessity of sweeping away, com- 
pletely and for all time, the time-serving conception which has thus far 
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so largely dominated the organization and administration of modern 
language instruction. The continuance of this indefensible iniquity would 
be a crime against both teachers and students. It is the direct cause of 
many of the worst evils attending modern language work. It puts a 
premium on stupidity and laziness, and penalizes intelligence and indus- 
try. To give entirely different examinations to each of the four high 
school classes, for example, and to refuse all first- and second-year stu- 
dents admission to the third- and fourth-year examinations, is nothing 
but gratuitously punishing the best of the students in first- and second- 
year classes for not being stupid enough to require four years to surpass 
the average of all fourth-year students. It is precisely the highest 15 
or 20% of first-year students who, in spite of our almost malicious and 
invidious efforts to stultify and misrepresent them, will ultimately con- 
tribute 70% of the successful students in advanced courses. 

Constructive usefulness the only justification for tests and examina- 
tions. — Incredible as it may seem, sone teachers take this position, 
candidly: “Even if some students do, through good fortune in falling in 
with a good teacher, or through unusual intelligence and application, 
learn more French in one year than average students do in four years in 
high school, they do not deserve fourth-year credit, because they have 
taken it only one year; if they have done their chore early, let them 
mark time; they should not receive full fourth-year credit until they 
have served their full sentence of four years.” It seems difficult to recon- 
cile this position with any acceptable theory of education or of simple 
truth. Since the primary purpose of grades is a truthful and understand- 
able description of defined achievement to the end that the schools may 
wisely guide the pupils’ efforts to ascend the educational ladder, and direct 
their own efforts into the most profitable channels, we should never be 
led into wilful misrepresentation of the capacity or achievement of a 
student by considerations which regard grades as gifts, honors, moral 
judgments, or signs of mythical or hidden spiritual growth. If a student 
can read French no better than the average first-year student, let us say 
that by the grade that we assign him, whether he has ‘“‘taken”’ French three 
or five years or not at all; and if he knows as much French as the average 
fourth-year student, let us indicate that by the grade assigned to him, 
whether he has “taken”? French only a year, or a month, or not at all. 
“Taking” French is not the only way in which it may be profitably learned. 

Teachers everywhere are beginning to appreciate the elemental neces- 
sity for comparability of school grades. They are beginning to realize 
the vanity and futility of maintaining so-called standards which are 
proudly unique and individualistic, but meaningless, because they are 
expressed in unknown terms. ‘The increasing use of objective and reliable 
tests is making possible research which is not only genuinely scientific, 
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but is on a scale large enough to secure conclusions of more than local 
significance. The fact that these new types of tests afford measures that 
are comparable will make the conclusions of all research workers and 
teachers at least understandable, if not acceptable, to every competent 
teacher. Teachers are beginning to realize that the grades they assign 
to their students are not merely outlets for their personal ideals, educa- 
tional or otherwise, or personal predilections and aversions; but that the 
only excuse for or purpose of grades is to convey accurate information, 
couched in understandable terms, which may be used by students, parents, 
teachers, and school administrators in the intelligent and constructive 
educational and vocational guidance of our young people. In so far as 
grades fail of this purpose, they are vain and wasteful iniquities, like signs 
at cross-roads which misdirect travelers and guide them into a wilderness 
of confusion, doubt, and despair — blots on the landscape which ought to 
be swept away as public nuisances. 
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FOREWORD 


The Board of Regents of the University of the State of New York wish 
to tender their thanks to the Carnegie Corporation for the opportunity to 
conduct an extended experiment with new-type tests, which the gift of five 
thousand dollars has afforded them. It has given them great satisfaction 
to have their organization used for this purpose of investigation in the 
important field of examination in the schools of the State. 

The results of the experiments no one could predict. The material has 
now come in and has been examined by Professor Ben D. Wood, who has, 
as shown in the following pages, made certain definite interpretations. 
His conclusions have been reviewed by Dr. Avery W. Skinner, Director of 
the Division of Examinations, and Dr. Warren W. Coxe, Chief of the 
Bureau of Educational Measurements, in the State Department of Educa- 
tion, and while they are in agreement with some of them, about others 
they are in considerable doubt. The responsibility, however, is with 
Doctor Wood; and the Board of Regents and the State Department are 
very appreciative of his careful and scientific study, and are much gratified 
that he has succeeded in securing the funds necessary for the publication 
of this volume. It should without doubt form the basis for still further 
investigations and studies in this field. 

FRANK PIERREPONT GRAVES 


President of the University of New York 
and State Commissioner of Education 
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I 


INTRODUCTION 


The State Department of Education of New York has been increasingly 
interested in recent years in the new developments in the field of educa- 
tional measurements. The growth of this interest has been a natural 
evolution, in thorough consonance with the history of the State Depart- 
ment’s examining activities. New York State, through its Regents exami- 
nations, has long been distinguished as the only State in the Union that 
has seriously attempted to supply the schools and colleges with educa- 
tional measurements which are comparable on a state-wide basis over a 
series of years. 

During the last decade the volume of the Regents examinations has 
grown so rapidly that the burden on the State Department has become 
formidable and threatens to become a serious administrative problem, 
aside from budgetary considerations. If examinations are to be scored in 
time to be used for constructive educational guidance of students, time 
and space factors are even more important than budgetary strictures. 
Moreover, recent studies of the Regents examination results, conducted 
by the State Department, or with its active codperation, have indicated 
that there is considerable room for improvement in the Regents examina- 
tions, both in the matter of accuracy of measurements of achievement, 
and in the matter of maintaining uniform and meaningful standards. 

The present experiment was undertaken by the State Department to 
learn what contribution objective forms of examinations might make, 
under actual working conditions and on a large scale, to this double prob- 
lem of increasing the validity and comparability, and decreasing the 
costs of the Regents examinations. In the fall of 1924, Dr. James Sulli- 
van, Assistant Commissioner for Secondary Education, invited the author 
of this report to codperate in the experiment by supplying suitable tests, 
by assisting the State Department in supervising the administration and 
scoring of the tests, and by making a detailed analysis of the results and 
writing a report. 

In collaboration with experts in the several subject matters, the writer 
had been working since 1922 on a series of objective or new-type collegiate 
placement tests, and several of these were nearing completion in 1924. 
It was, therefore, easy to prepare the tests in time to be printed and ad- 
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The subject-matter specialists whose collaboration made possible the pro- 
duction of the tests were, for the French test, the late lamented Professor 
A, A. Méras and Miss Suzanne Roth; for the Spanish test, Professor 
Frank Callcott; for the German test, Professor C. M. Purin; and for the 
Physics test, Professor H. W. Farwell. 

Plan of the experiment.— In June, 1925, the three-hour periods of 
the Regents examinations in French, Spanish, German, and Physics were 
equally divided between the old- and the new-type forms of questions. 
The old-type subjectively scored examinations were given during the first 
ninety minutes, and the objective tests during the second. The princi- 
pals, school department heads, and teachers throughout the state were 
informed in detail of the plan of the experiment about six weeks before 
the examinations were given. Copies of the Directions to Students and 
examples of each form of question in the’objective tests in each subject- 
matter were sent to all department heads in modern languages and phys- 
ics, with the request that all students who were to take the Regents 
examinations be as thoroughly acquainted with the objective forms of 
questions as possible. Teachers were urged to drill their students on the 
details of the new-type questions by using objective questions in their 
weekly or bi-monthly quizzes during the remainder of the semester. 
This ‘‘fore-exercise’’ was entirely vindicated, for in the 45,000-odd new- 
type tests resulting from this experiment scarcely a dozen were found 
in which there was any evidence that the student had misunderstood 
the directions or had been confused by the form of the questions. 

The old-type parts of these examinations were scored in the usual way 
by the teachers in the schools, and the usual tentative marks assigned 
to the students by the schools were based upon these ninety-minute old- 
type parts, just as though they were regular three-hour Regents examina- 
tions. The new-type parts were not scored by the teachers in the schools, 
but by a special staff of clerks under expert supervision in the State 
Department in Albany; but the principals and teachers understood that 
it was planned to use new-type results in the State Department’s reviewing 
of the teachers’ ratings on the old-type parts of the examinations. Thus 
while the teachers knew that the tentative marks assigned by the schools 
were entirely independent of the new-type results, they knew that the 
State Department planned to use new-type results in determining final 
“Regents Grades,” as distinguished from tentative or local school ratings. 
The fact that, on account of unforeseen circumstances, the new-type 
results could not be entered on the Regents records in time to be made 
a partial basis for determining Regents credit, does not in any manner 
detract from the trustworthy character of the experiment. In the first 
place, it was not known until two months after the examinations were 
administered that the new-type results could not be used according to 
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original plans. In the second place, the failure to use new-type results 
in determining Regents credits for individual students has not prevented 
their use in determining, objectively and reliably, the comparative achieve- 
ments of different schools and of individual teachers in the same school. 
Even if there were a few teachers so constituted intellectually and morally 
as to conspire with their students to defeat the purposes of the experiment, 
the only effect would be to make the new-type examinations seem less 
valid than they really are. It is quite certain, however, that such lapses 
were not frequent enough to vitiate appreciably the scientific validity of 
this experiment. 

The examinations. — The old-type parts of the examinations used in 
this experiment were of exactly the same nature as the usual Regents 
examinations in these subject-matters; the only difference is that they 
were shorter. They were constructed, edited, printed, administered, read, 
and reviewed in the regular way. The old-type parts of the French and 
physics examinations are reproduced below on pages 200 to 207. 

The new-type tests in French, Spanish, and German are parallel in form, 
each consisting of three Parts: 

Part I is a vocabulary test of 100 multiple-choice items. Each item 
consists of a foreign language word, followed by five English words which 
are numbered. The student selects the one of these five English words 
which most nearly corresponds in meaning to the foreign language word, 
and puts its number in parentheses at the right of the item. The score 
is the number of correct identifications which the student makes. 

Part II is a reading-comprehension test of 75 true-false items, carefully 
and experimentally graded in difficulty. Each item is a statement in the 
foreign language of an obvious truth or of an obvious fallacy. The 
truth or the falsity of each statement is easily within the knowledge of 
any high school student intelligent enough to study a foreign language, 
so that it is a test of ability to read the foreign language and not a test 
of special knowledge. The student indicates his understanding of the 
statements by putting a plus sign in the parentheses at the right of true 
statements, and zero at the right of false statements. The score is the 
number of correct responses diminished by the number of wrong responses. 
This method of scoring is used in Part II to overcome the gross effects 
of guessing on the part of intellectually incautious students. Since there 
are only two possible answers, true or false, students unable to read a 
single one of the statements might toss a coin and secure 50% correct 
answers on the average. Thus, students who mark fifty statements by 
pure guess would on the average mark twenty-five correctly, and twenty- 
five wrongly: the scores of these students, on the average, would be 
25 minus 25, or zero. 

Part III is a 100-item completion test of grammar, idioms, spelling, 
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word-order, capitalization, accents, etc., and of active vocabulary. Each 
item consists of a short English sentence or phrase, followed by a foreign 
language translation in which one, two, or three words are missing. The 
student writes the word or words necessary to complete the translation 
on the dotted line at the right of the page. The score is the number of 
items perfectly completed. 

Only one new-type examination in each modern language for all year- 
classes in that language; comparability of measures. — The reader should 
notice at the outset one important and fundamental feature of the new- 
type modern language tests used in this experiment, namely, that each 
of them is designed to cover the whole range of achievement in high school 
modern language work. The old-type Regents examinations in each 
language include three separate and independent examinations, one for 
each year-class examined; but in this experiment we have used only one 
ninety-minute new-type examination for all year-classes in each language. 
This is, beyond doubt, the most important and significant feature of the 
experiment, because for the first time in the history of the Regents ex- 
aminations it insures comparability between the measures of achievement 
in all year-classes on a state-wide basis, and it points out a way of escap- 
ing from the old time-serving conception of educational achievement 
which has thus far dominated our examining methods.! 

At first thought it seems impossible that one ninety-minute examination 
could be adequate to measure the achievement of students in second-, 
third-, and fourth-year modern language classes; but the results of this 
experiment show not only that one examination can measure the whole 
range of achievement in high school modern language work, but that 
one examination can measure such achievement more accurately and more 
usefully than three old-type examinations measure it. There are several 
reasons why the new-type examinations accomplish this result; these will 
be discussed in detail later, but two may be mentioned in passing. In 
the first place, the foreign language materials used in them are carefully 
selected on the basis of objective data, such as word-counts, inventories 
of grammars, etc., and these materials are further checked up by experi- 
mental evidence. Elements which are too easy or too difficult, and those 
which do not give positive correlations with known criteria of achieve- 
ment, are eliminated. Each Part of the new-type tests is carefully scaled 
in difficulty on the basis of actual responses of students, so that the whole 
range of achievement is covered by small and regular increments of diffi- 


1The importance of comparability in educational measures can not be overestimated; it is not too 
much to say that most of the maladjustments in modern language classes have been due directly to the 
fact that thus far modern language examinations have not afforded us comparable measures of achieve- 
ment which could be used in constructive educational guidance of students. So long as separate examina- 
tions, the inter-relations of which are not known, are used for each year-class, homogeneous classes and 
efficient exploitation of individual aptitudes of students are almost impossible. 
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culty. In the second place, the questions are set up in such a form that 
fully ninety per cent of the irrelevant activities of the old-type examina- 
tions are eliminated, thus making possible much larger samplings of foreign 
language materials and of performances of the students than in the old- 
type. Hach of the new-type language tests used in this experiment includes 
275 separate and independent questions, each objectively scored. Large 
samplings of foreign language materials, experimentally validated and 
,sealed in difficulty, set up in objective question-forms, — these are the 
features which make for accurate and comparable measurements. The 
new-type tests used in this experiment exist in several equivalent forms, 
and additional equivalent Forms can be produced as needed. Only 
Form A of each of our modern language series was used in this experi- 
ment. These are reproduced in full below on pages 209 to 269. 

The new-type physics test consists of 144 true-false statements, all of 
which have been experimentally verified as to validity (see page 299 
below), and which are distributed over the whole field of high school 
physics in such a manner as to give equitable emphasis to the various 
topics. The score is the number of correct responses diminished by the 
number of incorrect responses. This test is reproduced below on pages 
270 to 282. 

Scope of experiment. — This experiment involves, in effect, a state-wide 
objective survey of student-achievement in four departments of secondary 
education. In addition to the measurement results obtained in this sur- 
vey, other types of information of considerable significance have been 
made available. Each of the new-type tests included a brief question- 
naire to be answered by the student covering such matters as age, sex, 
high school course, number of years which had been, and number of addi- 
tional years which would be, devoted to modern language (or physics) 
study, plans for further education, probable life work, subject-matter 
preferences, etc. This questionnaire was answered by more than ninety 
per cent of the students. Taken in conjunction with the reliable and 
comparable measures of actual achievement afforded by the tests, these 
questionnaires are capable of revealing facts of great significance for the 
fundamental policies and administration of these departments of instruc- 
tion, in particular, and of secondary education in general. But the main 
purpose of the experiment was to learn what contribution the new-type 
examinations might make to the Regents examination system in terms 
of reliability, validity, and comparability of educational measurements, 
and in terms of budgetary economies and administrative conveniences. 
It is hoped that a complete! analysis of the questionnaire results may be 
presented in a later report. The present report will confine itself almost 


1A preliminary study of the responses of 5000 students to the questionnaire on the new-type French 
test is to be published in an early issue of the Modern Language Journal. 
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entirely to an analysis of our examination results, and of the examinations 
themselves. Table 12 shows the number of examination returns for each 
class in each of the four subject-matters on which the following studies 
of reliability, validity, comparability, and costs of old- and new-type 
examinations are based. 


TABLE 12 


NUMBERS OF STUDENTS TAKING NEW-TYPE REGENTS EXAMINATIONS IN FRENCH, 
SPANISH, GERMAN, AND PHYSICS (EXCLUDING DUPLICATES AND A FEW WHOSE PAPERS 
WERE NOT USABLE FOR VARIOUS REASONS).! 


French II 13,486 
Ill 6,741 
IV 489 
Total French | 20,716 
Spanish IT 5,340 
III 2,467 
IV 226 
Total Spanish 8,033 
German II 1,482 
III 667 
IV 127 
Total German 2,276 
Total Modern 
Language 31,025 
Total Physics 14,081 
Grand Total 45,106 


1 Space limitations do not permit us to discuss several interesting features of Table 12; but we cannot 
forego the opportunity of calling attention to one indication of supreme importance to the whole theory of 
offering modern languages in secondary schools, namely, the relative sizes of the 2nd-, 3rd-, and 4th-year 
classes. Such large numbers of students in 1st- and 2nd-year classes can be justified only by the spiritual 
and intangible gains of students, because they certainly do not learn the language. 


II 
RELIABILITY AND VALIDITY OF THE EXAMINATIONS 


RELIABILITY OF OLp- AND Ngew-Tyrr EXAMINATIONS 


By reliability of a test is meant the degree of its consistency, or of its 
self-agreement in measuring the same fact or series of facts two or more 
times. The most rigorous way in which to determine the degree to 
which a given form of examination agrees with itself is to administer two 
equivalent examinations of the given form to the same students, and 
calculate the correlation between the two sets of scores. It is not often 
feasible, however, to subject students to two full examinations of a given 
type; hence the most usual way of estimating the reliability of a test is 
to ‘‘split” it into random halves, and treat the scores on these random 
halves just as though they were derived from two independent and sepa- 
rately administered examinations. By calculating the correlation be- 
tween the two sets of random-half scores, we learn the reliability of one- 
half of the examination in question. The statistical symbol for this 
reliability coefficient of one-half of a test is 733, and is read “‘r-one-half- 
one-half.”’ The reliability of the whole examination, ry (r-one-one), can 
be very closely estimated from 74; by means of the Spearman-Brown for- 
2ryy 
rie 
estimates the reliability of old-type subjective examinations, and under- 
estimates ! that of some of the new-type objective forms of tests; but the 
total margin of possible error in our use of this formula is probably very 
small in relation to the large differences between the estimated reliability 
coefficients of the old- and new-type examinations with which we are 
concerned. 

Reliability of old-type examinations. — In order to make our estimate 
of the reliability of the old-type as favorable as possible to the old-type, 
only papers from the largest half-dozen schools (in terms of modern 
language enrolment) in the state were used for statistical study. The 
reading of the old-type papers in the large schools is usually done by con- 
sultation of several teachers, under the immediate supervision of depart- 


mula, rn = There is some evidence that this formula over- 


1 See, e.g., Wood: “Studies of Achievement Tests,” Journal of Educational Psychology, January, Febru- 
ary and ‘April 1926; ‘‘Measurement of Law School Work, I and II,” Columbia Law Review, March, 1924, 
and March, 1925; ‘“‘ New-Type Examinations in the College of Physicians and Surgeons,”’ Journal of Per- 
sonnel Research, October and November, 1926. 
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ment heads of many years’ experience in teaching and in reading Regents 
examinations; it is therefore almost certainly better than average reading. 
No other basis of selection was used; the schools having been selected, all 
the papers from them, both passing and failing, were used. The varia- 
bility of scores in the whole group of papers from these schools turned 
out to be only negligibly different from that of all the schools in the state; 
the only noticeable difference was in the fourth-year classes. It should 
be noted that only the papers ‘claimed,’ i.e., passed by the schools 
(or very near passing in some cases), were reviewed by the State Depart- 
ment in Albany; the papers of students failed by the schools came to us 
direct from the schools. Review ratings were used for all questions in 
all papers which were so rated; that is, our study of the reliability of the 
old-type is a study of the end-product of the Regents system, not of 
school ratings. School ratings were used only when they were accepted, 
or not changed by the State Department reviewers. 

For calculating reliability the old-type papers were ‘“‘split” in such a 
way as to include exactly fifty per cent of the credits in each half, and at 
the same time to include each kind of question, in so far as possible, in 
each half. For example, the French II paper, consisting of seven ques- 
tions (see page 200, below), was divided as follows: 


First half: 
Q. 1, translate 12 lines of French into English . . . . . 25 credits 
Q. 3, write tenses of three of six given verbs SE ar aie a ko 
Q. 6, translate into French five of seven English phrases . . 10 “ 
50 credits 
Second half: 
Q. 2, translate six lines of English into French . . . . . 20 credits 
Q.-4, complete five of seven French phrases. . . . .. 5 & 
Q. 5, insert verb in five French phrases . . She (ay te en 
Q. 7, write a composition in French, about 75 cee re | Sa 
50 credits 


(Nore: Credit for oral work may be substituted for Question 7) 


Each of these halves thus represents a fairly typical forty-five-minute old- 
type modern language examination, just as all seven of the questions 
represent a fairly typical ninety-minute examination. The second-, third-, 
and fourth-year examinations in French and in Spanish were divided in 
this same manner, and correlations were calculated with the results shown 
in Table 13. In view of the large number of cases involved, the coeffi- 
cients for the second-year papers may be accepted with confidence as 
fairly representative of the highest expectancy of reliability in old-type 
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modern language examinations. These results are in agreement with all 
other available evidence, that the reliability of a ninety-minute old-type 
examination is not above 0.80. The results for the fourth-year papers 
should be interpreted in the light of the small number of cases involved. 


TABLE 18 


RELIABILITY COEFFICIENTS OF OLD-TYPE 90-MINUTE REGENTS BXAMINATIONS IN 
FRENCH AND SPANISH oF JuNE, 1925. Tur FIRST COLUMN HEADED Yi; SHOWS THE 
CORRELATIONS BETWEEN RANDOM HALVES OF THE OLD-TYPE TESTS; THE NEXT COLUMN 
SHOWS THE RELIABILITY COEFFICIENTS OF THE WHOLE TESTS AS ESTIMATED BY MEANS 
OF THE SPEARMAN-BROWN FORMULA. THE THIRD COLUMN SHOWS THE NUMBER OF 
CASES ON WHICH THE CORRELATIONS ARE BASED. ‘THESE COEFFICIENTS ARE BASED 
ON REVIEW RATINGS OF PAPERS FROM FIVE OF THE LARGEST SCHOOLS IN THE STAtTs.! 


S-B 
Ty ma N 
French I1 .649 788 1,105 
Il O84 738 867 
IV -262 415 85 
Spanish IT .565 122 1,016 
Ill .538 -700 629 
IV 032 .695 95 


Reliability of new-type examinations. — Table 13 should be compared 
with Table 14, which shows similarly derived reliability coefficients for the 
new-type tests. The new-type tests were split by purely random divi- 
sions, the first half of each test being made up of the odd-numbered items, 
and the second of the even-numbered items. Papers were chosen from 
schools of all sizes and classes, so that the reliability indications of Table 14 
represent what we might expect of the new-type tests under average condi- 
tions. The groups of students represented in Table 14 are representative 
of their respective classes as to average achievement and variability of 
scores. 

Table 14 shows the reliability coefficients of each Part of each modern 
language test as well as of each whole ninety-minute new-type test. 

In order to compare the reliabilities of old- and new-type tests of equal 
time-allowances, therefore, we must confine our attention to the last three 


columns but one in Table 14. 
The highest Spearman-Brown reliability coefficient for an old-type 


1It was not found possible, within our time-limits, to secure a sufficient number of cases of the old- 
type German examinations to calculate correlations. Such data as we have lead us to believe that 
German examination reliabilities are not very different from those of French and Spanish examina- 
tions. We know from other studies that the reliability of a 90-minute old-type Physics examination is 
about 0.70, and certainly not above 0.75. (See Wood, Measurement in Higher Education, World Book 


Co., Yonkers, N. Y., 1923. Cf. esp. ch. 9.) 
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examination is that for French IT, 0.788. The highest Spearman-Brown 
coefficient for a new-type examination based on returns from a single 
year-class is 0.955 (French II); the average of new-type reliabilities based 
on single year-class samplings is 0.94. This is about 0.15 higher than the 
highest old-type reliability. The Spearman-Brown reliability of each of 
the new-type modern language tests, when based upon second-, third-, 
and fourth-year classes taken together, is above 0.96. 

The reliability coefficients for Part III of the new-type tests are all 
higher than the highest old-type reliability. The lowest Part III coeffi- 
cient is that for French III, 0.851. This means that the lowest reliability 
indication we have for a forty-five-minute new-type test is about 0.06 
higher than the highest indication we have obtained for an old-type 
ninety-minute examination. This is also very nearly true of the twenty- 
five-minute new-type vocabulary test. If we compare old- and new-type 
reliabilities class by class we find that, without exception, the reliabilities 
of the twenty-minute true-false reading tests are higher than those of the 
ninety-minute old-type examinations! 

Reliability coefficients of alienation. — These differences are so large 
that comment is unnecessary; but those who are not familiar with the 
mathematics of correlation coefficients are very likely to underestimate 
the real magnitude of the differences. The very natural assumption 
which non-technical readers will unconsciously make is that correlation 
coefficients are like any other measures such as of length or weight, and 
indicate merely numbers of equal units or of equal amounts of relation- 
ship. Unfortunately, this is not the case. The numerical difference be- 
tween coefficients of 0.50 and 0.60 is 0.10, which is equal to the numerical 
difference between coefficients of 0.80 and 0.90; but the latter difference, 
in terms of real correlation, is about two and one-half times as great as 
the former. 

In order to render numerical comparisons valid, we must transmute our 
coefficients of correlation into coefficients of alienation, in which numerical 
differences are always equal to real differences in degrees of relationshio 
or of correlation. When the ranges of talent or achievement on which 
the original correlations are based are equal, as in the present case, nu- 
merical comparisons of coefficients of alienation are just as valid as in 
the case of height and weight measurements. The coefficient of aliena- 
tion (k in statistical parlance) is the ratio of an error of estimate to the 
error of pure chance; that is, it tells us how nearly, for example, the scores 
on a test approximate a pure chance relationship to the scores on another 
test, or to the facts which it is supposed to measure. Hence a numeri- 
cally high coefficient of alienation means a close approximation to chance, 
and therefore a low correlation; a numerically low coefficient of alienation 
means a high correlation. Using the formula k = V1 — 7, we have trans- 
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muted the Spearman-Brown reliability estimates of whole ninety-minute 
tests (Tables 13 and 14) into coefficients of alienation, with the results 
shown in Table 15. 


TABLE 15 


RELIABILITY COEFFICIENTS OF ALIENATION OF OLD- AND OF NEW-TYPE REGENTS 
EXAMINATIONS BASED ON SPEARMAN-BROWN RELIABILITY ESTIMATES OF TABLES 13 
AND 14. THE THIRD ROW OF FIGURES SHOWS THE RATIOS OF OLD-TYPE TO NEW-TYPE 
COEFFICIENTS. THE OLD- AND NEW-TYPE TESTS HERE COMPARED ARE ALL 90-MINUTE 
TESTS. 


FRENCH SPANISH GERMAN PHYSICS 


II Ill IV II Ill EV) II EEE IV 


Old Type .616 | .674 | .910 | .692 | -714 | .720 


New Type .297 | .394 | .421 | .308 | .349 | .474 | .296 | .807 | .383 | .505 


O 
ratio: at 2.07 | Lvk | 2169) 2.24 | 22.045 es 


The general conclusion from Table 15 is that the new-type tests are 
from one and one-half to two and one-quarter times as reliable as the old- 
type ninety-minute tests. 

Standard errors of estimate. — The reader will have noticed in Table 13 
(q.v.) that the reliability coefficients seem to decrease in magnitude al- 
most consistently from second- to fourth-year classes. It is important to 
ascertain whether this decrease in the numerical magnitude of coefficients 
indicates a genuine decrease in reliability, or is merely a false appearance; 
because it involves the question of the adaptability of new-type examina- 
tions to the measurement of achievement in advanced as opposed to ele- 
mentary courses. There are some eminent modern language teachers 
who strongly advocate the use of new-type tests in elementary modern 
language work, but who are unconvinced of their usefulness in advanced 
courses. If the reliability of the new-type tests is found to be lower in 
fourth-year than in second- or third-year classes, the adaptability of 
objective forms of examinations for advanced courses would at least be 
thrown in question; if, however, it is found that the reliability is as great 
in fourth- as in second-year classes, then doubt on this point is removed 
in so far as reliability is concerned, and in so far as fourth-year high school 
work may be considered as advanced. 

As indicated above, correlations expressed as coefficients of alienation 
may be directly compared numerically when the variabilities of the groups 
of students involved are known to be equal, as in the case of Table re 
In order to compare correlations based on test results from student. 
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groups of unequal variabilities, they must first be expressed in terms of 
the same units, or in terms of units whose relations are known. Since 
in the present instance we desire to compare the reliabilities of the sam? 
new-type tests applied to three different year-classes, we may use the 
seales of raw score-points of the tests as the common denominators of all 
the correlation coefficients for each of our modern language tests. If the 
values of the score-points are equal throughout the whole range of scores, 
this method of comparison is entirely valid. There is considerable evi- 
dence, too complex for brief exposition, indicating that the score-points 
in each of the new-type tests are equal, or very nearly equal, in all ranges 
of scores. 

Referring to Table 14, the reader will notice that the figures in the 
columns headed o; decrease from second- to fourth-year classes even more 
notably than do the reliability coefficients in the columns headed r; and 
S-Brn. Thus the ry,’s of Part II of the French test decrease from 0.751 
for French II to 0.562 for French III, while the average standard devia- 
tions of the halves of Part II, the ox’s, decrease from 8.88 to 4.64! Using 
the formula o3.; = 0, V 1—r%y, we find that the standard error of estimate 
of one random half of Part II from the other half is 5.86 score-points for 
French II and only 3.84 score-points for French IV! In other words the 
Part II ry of 0.562 for French IV indicates a degree of reliability about 
40% higher than that indicated by the ry of 0.751 for French II. Similar 
calculations have been made with all the other 73;’s and oy’s in Table 14 
with the results shown in Table 16. Like coefficients of alienation in 
Table 15, the standard error of estimate in Table 16 for any one of the mod- 
ern language tests may be compared just as ordinary linear measurements. 
If we care to make the very reasonable inference that the score-points of 
all three modern language tests are approximately equal, then compari- 
sons may be made between all the figures of Table 16, except those for 
physics. As in the case of the coefficient of alienation, the lower the 
numerical value of the standard error of estimate the higher is the re- 
liability indicated. 

The general conclusion from Table 16 is that the reliability of the new- 
type modern language tests tends to be greater for fourth- and third-year 
classes than for second-year classes. The only Part of the new-type 
examinations which shows any tendency to be less reliable as we go from 
the second to the fourth year is Part III. This Part is the grammar- 
completion form of test which is so widely used in old-type examinations. 
Parts I and II of the new-type examinations — the only Parts not uni- 
versally sanctioned by old-type examination practices — are consistently 
and progressively more reliable from second- to fourth-year classes. It 
is rather curious that the only form of question in the new-type tests 
which is regularly used by old-type examination makers, even in ad- 
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TABLE 16 


STANDARD ERRORS OF ESTIMATE OF RANDOM HALVES OF NEW-TYPE REGENTS EXAMI- 
NATIONS OF JUNE, 1925, BASED ON 1}}’8 AND STANDARD DEVIATIONS IN TaBLeE 14, 
SHOWING THAT, EXCEPT IN Part III, RELIABILITY TENDS TO INCREASE AS WE GO FROM 
SECOND- TO FOURTH-YEAR CLASSES. 


STANDARD Errors oF EstTIMATE BY THE FORMULA 
oy = 04 V 1 — ry 
Part I Part II Part III Whole Test 
Mires Wl 5 5 3.92 5.86 3.70 10.20 
Ill me of 3.61 4.03 Se 6.90 
Via ee 3.68 3.84 4.21 7.92 
Total French .. 3.67 4.90 4.00 7.90 
Spaniche lla 4.40 4.58 202 7.22 
iil ey 2.91 4.07 4.05 5.96 
IV ee Qt 3.30 3.18 6.59 
Total Spanish. . 3.70 4.82 3.84 6.98 
German II ae 3.78 bi 5 3.88 8.25 
180 Dy: 2.82 4.43 4.05 7.44 
IV ae So 2.26 3.42 5.00 6.60 
Total German. . 3.53 4.72 4.57 7.62 
Physics Sees 8.98 


‘ vanced college courses, is the only form of test that seems less reliable 
for third- and fourth- than for second-year high school classes. Even 
for Part III, however, the differences are neither consistent nor excessively 
large, so that this indication should be accepted only tentatively. More- 
over, these facts apply only to reliability, and reliability is by no means 
the ultimate criterion in judging the acceptability of given forms of tests 
for given purposes. The constructive indication here, that Parts I and 
II and the wholes of the new-type tests are equally or more reliable for 
fourth- and third-year classes than for second-year classes, is more im- 
portant than the small reverse differences noted for Part III. 

Probable errors of estimate.— Table 17 shows the probable errors of 
estimate of each new-type test. The figures for the three modern lan- 
guage tests are based on returns from second-, third- and fourth-year 
classes taken together. Although these probable errors are satisfactorily 
small, the reader should remember that they apply to halves of the new- 
type tests. For example, the first entry in column 1 tells us that the 
probable error of estimate of scores on one 125-minute French vocabulary 
test from scores on a similar 12}-minute vocabulary test is 2.48 score- 
points (or words). The probable error of estimate of scores on one 


NEW YORK HIGH SCHOOLS 2b 


ten-minute true-false French reading test from scores on a similar test is 
3.30 points. The probable error of estimate of one-half (45 minutes) 
the whole new-type French test from the other half is 5.33 points.! 
The probable error of estimate of a score on a ninety-minute new-type 
test from estimated “true” scores would be about 4.45 score-points. 


TABLE 17 


PROBABLE ERRORS OF ESTIMATE AND OF PLACEMENT OF RANDOM HALVES OF NEW- 
TYPE TESTS BASED ON 1y,’S AND oy’S IN TaBLE 14. THE FIRST FOUR COLUMNS SHOW 
PROBABLE ERRORS OF ESTIMATING SCORES ON ODD-NUMBERED ELEMENTS FROM SCORES 
ON EVEN-NUMBERED ELEMENTS; THE FIFTH COLUMN SHOWS PROBABLE ERRORS OF ESTI- 
MATE OF SCORES ON ONE-HALF THE TEST FROM “TRUE”? SCORES. THE FIGURES IN 
PARENTHESES IN COLUMN 5 SHOW THE PER CENT THAT BACH P.E.© 3 18 OF THE DIF- 
FERENCE BETWEEN NEW-TYPE HALF-TEST AVERAGE SCORES OF SECOND- AND THIRD-YEAR 
STUDENTS. 


PROBABLE Errors or Estimate PrRoBABLE ERRors OF PLACEMENT 
P.B.ag= .6745 ox V 1 — 1x3 P.E. 0.4 = .6745 03 Vry3 — 743 
Part I Part IT Part III Whole Whole Test 
Test 
French. . 2.48 3.30 2.70 5.33 3.68 (19.35%) 
Spanish .. 2.50 a Od) 2.59 4.71 3.29 (15.65%) 
German . . 2.38 3.18 3.08 5.14 3.34 (19.0%) 
Physics. 6.06 3.97 


Probable errors of placement. — The last column of Table 17 shows the 
probable error of estimate of “true” scores on one-half of each new-type 
test from obtained scores on one-half of each test. From these figures 
we learn that the probable error of placement of students on the basis 
of one-half the new-type French test is 3.68 score-points, which is 19.35% 
of the difference between the average scores of second- and third-year 
French classes. In other words, if the French students in New York 
high schools were reclassified on the basis of scores on a random half of 
the new-type French test, half of the students would have been mis- 
placed by less than one-fifth of a year’s growth in French achievement 
(if difference between second- and third-year average scores may be 
called a year’s growth), and half would be misplaced by more than one- 
fifth of a year’s growth. Less than 10% of the students would be mis- 
placed by more than one semester, and less than 1% of the students 

1(By taking the S—B estimate of reliability of the whole French test, and using the standard deviation 


of total scores of all three classes taken together, the probable error of estimate of scores on the ninety- 
minute new-type French test from scores on another similar test is 6.24 score points.] 
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would be misplaced by a whole year. This error, large as it is, is still 
much smaller than that of the present classification of modern language 
students. As we shall see (below, page 157) an average of about 30% of 
second- and third-year students in all three languages are misplaced by a 
whole year or more. The error of placement on the basis of a whole 
ninety-minute new-type test would, of course, be still smaller. 


Vauipiry oF New-Tyre TEstTs 


Thus far we have been mainly concerned with the reliability of tests, 
that is, their consistency in measuring whatever they measure. Relia- 
bility is an important requisite; for unless a test agrees with itself when 
applied to the same facts two or more times, it cannot be said to measure 
any definite or definable thing or function, good or bad. Other things 
being equal, the most desirable form of test is the one which has the 
highest reliability coefficient. But the most important feature of a 
test is what it measures. We have shown that the new-type tests are 
about twice as reliable as old-type tests of equal time-allowance, and that 
they are at least as reliable for fourth- as for second-year classes; but 
unless the new-type tests measure real modern language achievement as 
well as, or better than, the old-type examinations do, no degree of re- 
liability or consistency in measuring something else would make them 
suitable additions to the present Regents modern language examinations. 
We have two or three ways of arriving at some conclusion as to what 
the new-type tests actually measure, — (a) their correlations with the 
old-type parts of the Regents examinations in the same subject-matters, 
(b) the intercorrelations of the three Parts of the new-type language tests, 
and (c) analysis of the internal constitution of old- and of new-type tests. 
We shall here take up (a) and (b), reserving (c) for a later section of this 
report (pages 198-307). In considering the correlations between old- and 
new-type tests set forth in Table 18, the reader must recall the very low 
reliability coefficients found for the old-type tests and displayed in Table 
13. Whatever the old-type examinations measure, they measure it so 
unreliably and inconsistently that no test, however valid and reliable, 
could correlate with them very highly, except by chance. 

Correlations of new- with old-type examinations. — In view of the large 
numbers of cases involved, the correlations of Table 18 may be accepted 
with considerable confidence. Of the twenty correlations in the table, 
four are based on more than 13,000 cases, eight on more than 5.000 cases, 
and twelve on more than 1,000 cases each. The correlations of new-type 
with school ratings, except those for French IV and physics, range from 
0.64 to 0.71. Considering the varying standards of severity of marking 
in different schools in the state, and the lowering effect which this mixing 
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of standards necessarily has on relationships between the new- and old- 
type tests, these correlations are satisfactorily high. That the correlations 
would have been higher if all the papers had been marked by the same 
readers is indicated by the fact that all but two of the correlations of new- 
type with Regents review ratings (column 2) are higher than the corre- 
sponding correlations with school ratings (column 1). The review ratings 


TABLE 18 


VALIDITY COEFFICIENTS AND STANDARD ERRORS OF ESTIMATE OF NEW-TYPE REGENTS 
EXAMINATIONS OF JUNE, 1925. THE CORRELATION OF NEW-TYPE WITH REVIEW RATINGS 
IS HIGHER THAN WITH SCHOOL RATINGS IN ALL CASES EXCEPT SPANISH II anp III. Tue 
STANDARD ERRORS SHOW THAT THE VALIDITIES OF THE NEW-TYPE TESTS IN SPANISH 
AND IN GERMAN BECOME GREATER AS WE GO FROM SECOND- TO FOURTH-YEAR CLASSES, 
WHILE THE FRENCH TEST VALIDITIES ARE FAIRLY CONSTANT. ‘THERE IS NO INDICATION 
ANYWHERE IN OUR DATA THAT THESE OBJECTIVE TESTS BECOME LESS ADEQUATE AS 
WE GO UP THE SCALE OF ACHIEVEMENT IN MODERN LANGUAGES. ON THE CONTRARY, 
THEY BECOME BOTH MORE RELIABLE AND MORE VALID. 


STANDARD ERRORS OF 
Estimate or New-Tyer 
CoRRELATIONS OF NEwW- ScorEes FROM OLD-TyYPE N 
TYPE WITH ScHooLt AND REVIEW 
RatTInes 
School Review NS NR 
ratings ratings 
Rrene A Re ees .686 .710 23.0 22, 13,486 
Ill Me Are Se .671 724 20.4 18.9 6,741 
Mp, Oa ae O10 O79 22.3 22.3 489 
Spanish Wrote ale 4 .699 551 22.1 25.7 5,340 
UNDE Sy eee .670 .654 18.7 19.1 2,467 
IV See .696 aif es} 14.7 14.5 226 
German Vig cag 669 719 29.1 27.2 1,482 
(Ub, 2.9 ieee availa .756 23.6 22.0 667 
LV eee .640 123 19.5 Wes 127 
Physics rca eee 556 582 19.0 18.6 14,081 


of Spanish II and III papers give correlations with new-type scores of 
0.551 and 0.654, respectively, while the school ratings give corresponding 
correlations of 0.699 and 0.670. We have no satisfactory theory to offer 
in explanation of these differences. The difference for Spanish III is 
small, and may be ignored as due to chance factors; but the difference 
for Spanish II is both large and genuine, for the correlations here are 
each based on over 5,000 cases. Six of the correlations in column 2 are 
greater than 0.70, which is as strong a vindication of the validity of the 
new-type tests as we could well expect from the old-type examinations 
here used as criteria. 
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New-type tests more valid for 4th- and 3rd- than for 2nd-year classes. — 
In preceding pages we have shown that the new-type tests are at least 
as reliable for fourth-year classes as for second-year classes. The ques- 
tion of the validity of objective tests for advanced modern language 
work was left to later discussion. The coefficients in columns 1 and 2 
are lower for fourth- than for second-year classes, and thus superficially 
bear out the position taken by a few modern language teachers that 
while objective tests are very useful for measuring achievement in ele- 
mentary courses they are not so well adapted to the measurement of the 
higher aspects of modern language work in advanced courses. But the 
raw correlations here are as deceptive as in Table 14 above. The standard 
errors of estimate in columns 3 and 4 in Table 18 show that the new-type 
test is equally valid in all three French classes, and is consistently and 
progressively more valid as we go from second- to fourth-year classes in 
Spanish and German. This is an important finding, and ought to be 
convincing to the most devoted believers in the capacity of the old-type 
to reveal the higher products of modern language teaching, because the 
superior validity of the new-type for fourth-year classes is here set forth 
in terms of very much closer correlation of new-type with fourth-year old- 
type examinations than with second- and third-year old-type examinations. 

Correlation of Parts of new-type tests with old-type examinations. — 
Table 19 shows that, in terms of agreement with old-type examination 
results, each of the three Parts of the new-type tests is satisfactorily valid. 
The forty-five-minute grammar-completion test gives higher coefficients 
than the twenty-five-minute vocabulary and twenty-minute reading tests, 
as was to be expected. The most significant indication of Table 19 is that 
Parts I and II become more valid as we go from second- to fourth-year 
classes, while Part III either remains about the same or actually becomes 
less valid in the higher classes than in the lower. The standard errors of 
estimate of Part I and II scores from old-type scores consistently become 
smaller from second- to fourth-year classes, with the single exception of 
the French vocabulary test, while the standard errors of estimate of the 
grammar-completion Part either remain fairly eonstant or increase. 
These facts, with those presented in preceding pages, constitute a con- 
clusive vindication, in terms of strictly old-type criteria, of all three new- 
type Parts, but particularly of the two most suspected Parts, the vocabu- 
lary and the reading tests. If the new-type is valid for second-year 
classes, which few would doubt even on a priori grounds, it is at the very 
least just as valid for fourth-year high school modern language work. If 
the reputed higher and less tangible values of fourth-year work really 
exist, and are measured by the old-type tests here used as criteria, the 
conclusion is inescapable that the objective tests of vocabulary and reading 
catch these elusive qualities at least as effectively as they measure the 
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less dignified achievements in second-year work. The most likely expla- 
nation is that fourth-year achievement is not different in fundamental 
nature or worth from second-year achievement, and that most or all of 
the greater mystery and intangibleness of the former is a pleasing but 
fictitious importation. Genuine progress in advanced modern language 
work depends quite as much on effective reading ability and knowledge 
of vocabulary as in elementary courses, with the difference that more 
, of both is desirable. Those who decry the ‘“‘mere mechanics” of the 


TABLE 19 


Vauipity oF Parts or REGENTS NEW-TYPE MODERN LANGUAGE TESTS. COoRRELA- 
TIONS BETWEEN REGENTS OLD-TYPE 90-MINUTE EXAMINATIONS AND EACH PART oF 
NEW-TYPE TESTS, WITH STANDARD ERRORS OF ESTIMATE OF SCORES ON EACH NEW-TYPE 
PART FROM SCORES ON OLD-TYPE EXAMINATIONS. 


CoRRELATIONS SranparD Errors or Estimate 
SupsecTs AND i 
CLASSES Part I Part II | Part III N 
Vocabulary Reading Grammar rR *ILR or R 
25 min. 20 min. 45 min. , 
Breneh lla A48 552 .693 8.3 8.2 7.9 543 
INO ep ees .500 313 .602 7.2 8.5 6.3 475 
ENG By .540 464 .566 8.9 6.8 10.0 444 
Spanish II. . .684 Boda. .816 8.4 8.5 8.6 492 
LED .568 894 724 6.5 6.9 8.8 567 
Va ore 484 480 .740 5.6 5.9 8.2 225 
German II .. 581 096 714 10.8 9.5 ES 578 
ED URE es .644 .613 541 8.2 8.1 15.0 513 
LAY .533 468 .673 5.2 6.6 tL 83 
Averages phat 554 491 .674 


language in advanced examinations are very likely guilty of sacrificing 
genuine progress in modern language work to a disembodied goal which, 
probably fortunately, has no real existence. In these new-type tests half 
the examination period has been allotted to vocabulary and reading; the 
indications are that more, not less, time should be given in all our exami- 
nations, particularly for advanced courses, to the humble but indispen- 
sable matter of vocabulary and reading. 

Intercorrelations of new-type Parts. — It is a matter of fundamental 
importance to the theory and practice of modern language examining to 
know the extent of the interdependence of the three supposedly distinct 
aspects of modern language achievement represented by the vocabulary, 
reading, and grammar-completion Parts of the new-type examinations. 
Some teachers have undoubtedly underestimated the importance of range 
of vocabulary as a condition for, and measure of, modern language achieve- 
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ment; indeed, deriding the value of “mere vocabulary”’ is a not infrequent 
indoor sport at some teachers’ meetings. The magnitude of the inter- 
correlations set forth in Table 20 shows how ill-founded is the popular 
disrespect for word-knowledge. 

The indications of Table 20 may be accepted with confidence, since most 
of the correlations are based on several thousand cases each. The gen- 
eral conclusion is that the intercorrelations of the vocabulary, reading, and 
grammar tests are high enough to establish the validity of each of them 
for general measurement purposes, and yet low enough to show that 
each has a distinctive contribution to make to the measurement of total 
achievement. It is of special interest to notice from the standard errors 
of estimate that the interrelations are, on the whole, closer in fourth- 
than in second- and third-year classes. The high correlations of the vo- 
cabulary and reading Parts with an old and tried form of examination 
such as the grammar-completion test are strong evidence of their validity; 
that their correlations are higher in advanced than in elementary high 
school classes is, with the facts presented in preceding pages, practically 
conclusive evidence of the high validity and indispensable values of the 
objective vocabulary and reading tests. 

The reader will have noticed in preceding tables, as well as in Table 20, 
that the errors of estimate of the true-false reading tests have decreased 
more notably and more consistently from second- to fourth-year classes 
than the errors of estimate of any other form of test. This seems to be 
due in part to the greater adaptability of the true-false form of question 
to brighter and more advanced students than to younger and duller stu- 
dents. In other words, the reliability and validity of true-false tests 
seems to depend more on the intelligent, though perhaps unconscious, 
codperation of the students than in the case of other forms of tests. It 
is quite likely that the multiple choice form of reading test, such as was 
used in the junior high school examinations,! is better for Regents pur- 
poses than the true-false form. The former is much easier to construct 
than the latter, and suffers from fewer restrictions as to content, vocabu- 
lary, and grammar. All things considered, and pending further evidence, 
the writer believes that multiple choice reading tests should be used to 
the exclusion of true-false tests except for experimental purposes. The 
slight tendency of the probable errors of estimate of the grammar-comple- 
tion test to increase from second- to fourth-year is probably due, in part, 
to the variableness of the content of the fourth-year course of modern 
language study. The tendency of some teachers to take their students 
into the vast fields of foreign language literature before they have mas- 
tered the fundamental vocabulary, grammar, and idiom of the language 
itself is too well known to need comment; but many wise teachers do not 


1 See Part I of this volume, particularly pp. 55 to 61 and 73 to 79. 
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assume that fourth-year students can do without further extensive in- 
struction in grammar. Thus we have some fourth-year students so ex- 
tensively engaged with literary selections that they perforce neglect 
effective exercise In grammar, while other fourth-year students are as- 
siduously strengthening and extending their active knowledge of gram- 
mar. The result is to make grammar test scores less valid as measures 
of total achievement in fourth- than in second- and third-year classes.! 

1 Professor Henmon suggests that ‘“‘The tendency of the probable errors of estimate in grammar to 
increase is probably due to the differences in emphasis on grammar in different classes, at different levels 
of progress and with differing methods. It is very noticeable in scoring different classes that Some are 
above the average in grammar but below in both vocabulary and reading. On the other hand, individual 
classes can be found which are above in vocabulary and reading and below in grammar. Pooling the re- 


sults from a large number of schools obscures these differences but produces a great variability, especially 
evident in grammar.’ 


III 


, SCHOOL DIFFERENCES: VALIDITY OF OLD-TYPE EXAMINA- 
TION RESULTS IN INDIVIDUAL SCHOOLS AND 
CLASSES OF SCHOOLS 


New-type results used as criteria for measuring validity of old-type 
examination results.— Although we have not yet presented all our data 
tending to establish the validity of the new-type tests, we believe that 
sufficient evidence has been presented to warrant the conclusion that 
they give more valid and more comparable, as well as more reliable, 
measures of total modern language achievement than the old-type tests 
used in this experiment. For convenience of exposition in connection 
with Table 21, we shall assume this conclusion to be correct, in the firm 
belief that when all our evidence has been presented the reader will find 
the conclusion more than justified. The reader is therefore asked hence- 
forth to consider the new-type results as criteria of language achievement, 
and to consider ratings on the old-type examinations from different 
schools and classes of schools and from the State Department reviewers 
as being on trial. Thus we shall tentatively conclude that those schools 
whose ratings agree most closely with new-type scores are more accurate 
and valid in their reading of the old-type examinations than schools 
whose ratings agree less closely with new-type scores. 

In this report we shall call attention to only a few of the more impor- 
tant indications of Table 21, leaving it to the special interests of readers 
to examine other parts of the Table in detail. 

Validity of ratings from individual schools extremely variable. — 
Looking first at column (2) in Table 21, it is apparent that the correla- 
tions between new-type scores and school ratings on old-type examina- 
tions vary for different schools and classes of schools. The correlations 
in the upper half of column (2) (French II classes) vary from 0.599 to 0.875; 
and those in the lower half of column (2) (French III classes) vary from 
0.517 to 0.810. These variations in degree of agreement with new-type 
scores indicate that the accuracy of rating the old-type examinations is 
notably greater in some schools than in others. 

Validity of school ratings increases consistently with size of schools. — 
Still looking at column (2), it appears that the accuracy of reading old- 
type examinations is greater in large schools than in small schools. The 
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TABLE 21 


INTERCORRELATIONS BETWEEN NEW-TYPE SCORES, SCHOOL RATINGS AND REVIEW 
RATINGS ON OLD-TYPE EXAMINATIONS IN FRENCH FOR INDIVIDUAL SCHOOLS AND CLASSES 
OF SCHOOLS, SHOWING VARIATIONS IN ACCURACY OF SCORING IN DIFFERENT SCHOOLS, 
AND SHOWING THAT SOME SCHOOLS THAT MOST NEEDED REVIEWING ESCAPED IT. THE 
VARIATIONS IN CORRELATIONS ARE PARTLY ACCOUNTED FOR BY DIFFERENCES IN VARIA- 
BILITIES OF INDIVIDUAL SCHOOL GROUPS. NT STANDS FOR NEW-TYPE, S FOR SCHOOL 
RATINGS, AND R FOR REVIEW RATINGS. 


French II 
pees SIGMAS Mrans 
Scuoou No. N 
pele et Spe Rie knee 8 R NT S R 
(1) (2) (3) (4) Gee Gai) (8) (9) (10) (11) 

SO 7 Se (7545 ES04A OSG 25.077 | OMG W922 hoa: ioe Oo ero 102 
369 (Acad.) . | .652 | .555 | .779 | 38.7 3.00 || 2:99) 78.6: | 65.215) 64.7 58 
501. . . | 6738713665 | -946 | 26.1 6.90 | 6.98 | 145.0 | 74.4 | 74.7 59 
TAIL || ell. ee) ed) ess 8.26 | 8.0 | 175.0 | 78.37 | 78.3 70 
SOG a eA alle3 On |e 9S SunlezoLo $.52'| 8.38 | 169:6)|. 77.9) |-78.2 178 
S19 ee elev 2oul 22) 93 7el 222900 = S24) Sal laa Aa tS leo 144 
YR sol aetsBs || OS) |) 2ez 7 WeS2 | 4280 | 1Go.0 | io 9a heLoo 335 
SOA ES oulls lOO) OO taal) ZOO 2a O22 | eS GO toa Oe cOslonlen Gee 143 
O19 = 187598287) 29655) 29.381) -9.00)| 3 9alGn 15220 7S.85 | cee 129 
6960 2 2 e725) 24.9905) 22°96) 1 1OL00 Os 75 L6G27 (955) 29:3 132 
O42 ee I Lali COORG OI E22 OS. lo UG ier scou ehO2 Olena oom aries 62 
971 . . .|.751 | .765 | .988 | 23.37. 8.74| 8.94 | 181.7 | 74.98 | 74.9 241 
Average 50+ | .747 | .744 | .961 |(25.00) 
205m en el alle OSM OOM noice 8.11 | 8.32 | 147.76) 72.99 | 72.94 1426 
1530 ee ell eet Sale) OA Gr olen to 8s \l4d US| FSO eeeGd Bo2os0 

1-15 2008 675092188 30:9: 5.66 | 8.32 | 146.09] 72.21 | 71.29 | 2786 
Sr. Schools . | .649 | .617 | .790 | 31.6 6.9 5.96 | 137.12] 70.89 | 69.36 358 
Academies . | .599 | .687 | .863 | 33.9 8.12] 8.8 | 144.68] 72.47 | 71.23 1228 
Total Fr. If. | .686 | /710 31.55 | 7.96] 8.88 | 150.4 | 73.54 | 72.74 | 13,486 

French III 

CO og heh || areca Oey eres I) Zan! 7.74 | 196.8 | 80.97 | 80.79 61 
GOCu eee el OL Ne LOS ec OSO See 2 rose 9.72 | 192.9 | 77.44 | 78.19 68 
(Ms se a aGAO leet Nee O37 al Quek OOM SO 7.8 | 196.8 | 77.0 | 76.9 48 
7Ol ee eee SOLO) S1LO G9 sl ieoon |e Os 8.88 | 192.6 | 77.82 | 78.4 45 
SOG ce We neOOh eetliog|eOG9s R2do 7.98 9.02 | 187.4 | 78.06 | 77.3 66 
SHY SW a8} arabia) Ati |) Seals |) Hekskss GSO LSSU AT aeS Ten 43 
S62) OL POLI S9GF || ELON eS Zot 2: | LOGean cOnkva eros 235 
SOE 5 2 O85" 668") 2914 22.94. )— Se7Sal VS Ole E780 1579 22 tes 46 
By BW ORT ARE VPP IP PRA) |L RSS 7.02 | 177.6 | 75.44 | 74.09 43 
S845 ey ee oO LTE CLA O83 a 214 OaG 9.04 | 192.5 | 80.57 | 80.7 56 
GIQT ee e081 (887 OS Onl 22-00 escs 8.4 | 183.0 | 74.53 | 74.4 57 
OT | O8Sa(eGaGn ee Oma lsco 7.16 8.01 | 201.1 | 76.68 | 76.1 118 
984 . . . | .680| .689 | .994 | 20.0 8.09 8.06 | 194.6 | 78.72 | 78.8 263 
Average 40+ | .705 | .699 | .963 |(21.6) 
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TABLE 21 (Continued) 


30-40 SW GES AON OVA Mere leer) §.39 | 184.89} 74.88 | 74.77 460 
15-30 . . | 402) 698} .948 | 25.90 | 8.58 | 8.57 | 182.84] 75.44 | 74.46 749 

1-15 . |) 652) 653. | 900: | 26:4 7.84 282) || 18221 7425 72.28 1589 
Sr. Schools , 1.082) 1.53 (89 | 31.382 | 6.93 6.46 | 163.79] 73.39 | 69.39 58 
Academies . | .647 | .657 | .833 | 35.64] 7.95 7.26 | 166.35] 73.14 | 70.48 728 
Total Fr. III | .671 | .724 De AN AO 8.58 | 183.44] 75.67 | 74.63 | 6741 

French IV 

Total Fr. TV | yee! | 579 | | PA boohey i AeNE | 9.01 205.84 77.89 | LOsool | 489 


schools studied individually in Table 21 were selected so as to be repre- 
sentative of large schools all over the state, from among those having 
fifty or more students in French II, and forty or more in French III. 
The identifications by numbers of the individual schools and classes of 
schools are given in column (1). The correlations for the largest schools 
have been averaged, so that we may compare the schools that have fifty 
or more students in French II or III, as a group, with groups of schools 
having 30-50, 15-380 and 1-15 students, and with senior schools! as a 
group, and with the academies (private schools) as a group. The aver- 
age correlation between new-type and school ratings on the French II 
old-type examination for the largest schools is 0.747, and for the academies 
only 0.599. The decrease in correlation as we go from the large schools 
through the smaller schools and senior schools to the academies is con- 
sistent. 

Validity of ratings in such high schools as Morris, Boys, Albany, Utica, 
South Park, etc., nearly twice as great as in academies. — The differences 
between the six size-groups of schools studied in Table 21, as to accuracy 
of old-type examination ratings, are much greater than is indicated by the 
correlation coefficients in column (2). The apparent reversals in the order 
of correlations at the bottom of column (2) also disappear entirely when 
the correlations are interpreted in relation to the standard deviations of 
the groups on which they are based. Table 21a shows these relation- 
ships in terms of comparable units, that is, in terms of standard errors of 
estimate of new-type scores from old-type ratings. 

From Table 21a it appears that, in terms of agreement with new-type 
scores, the ratings of the old-type examinations in such schools as Morris, 
Boys, Albany, Utica, and South Park High Schools, etc., are roughly 
nearly twice as accurate and significant as they are In the academies and 


1 A senior school is defined by the Handbook of the University of the State of New York as an institution 
that is ‘registered as affording suitable facilities for maintaining an approved course of three years of 


academic work.” 
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senior schools in New York State. This indication, based upon objective 
and unbiased data, is of great significance with regard not only to the 
examinations of the State Department, but to its fundamental educational 
administrative policies. Taken in conjunction with the uniformly lower 
average achievements of the smaller schools and academies, and the 
greater heterogeneity of their classes [Table 21, columns (5) and (8)], 
Table 21a is basic evidence of the need for a centralized examining agency 
which can furnish all the schools of the state with accurate measurements 


TABLE 21a 


STANDARD ERRORS OF ESTIMATE OF NEW-TYPE SCORES FROM OLD-TYPE EXAMINA- 
TION RATINGS FROM SIX SIZE-GROUPS OF SCHOOLS, CALCULATED FROM THE CORRELATIONS 
AND SIGMAS OF TABLE 21, BY THE FORMULA on.o = ca. 1 — r’no. 


ScHoo, Groups Frencu II Frencu III 

50 + 17.0 15.3 
30-50 22.0 16.7 
15-80 22.2 18.4 

flo se 22k 20.0 
Senior Schools 24.0 25.4 
Academies 27.0 27.1 


of defined achievements expressed in comparable units. Abandonment 
or weakening of the Regents examinations would be a retrogression to 
chaos so far as accurate and meaningful measurements of school products 
are concerned. 

That the large schools also need a centralized agency is indicated by 
the differences between the correlations for the individual schools in 
column (2). For example, the French III classes of schools 791 and 862 
are almost equal in average achievement and in variability [columns (5) 
and (8)]; but the correlation of old-type school ratings with new-type 
scores for the former is 0.81, and for the latter only 0.517 [column (1)]. 
The accuracy of rating in 791 is about 50 per cent greater than that in 
862, in terms of standard error of estimate. This may be due partly to 
the fact that in 791 there were only 45 students to rate, while there were 
235 in 862. But comparing 791 with 806, which had only 66 students in 
French III, we find a still greater difference! The standard error of’ 
estimate in 806 is almost twice as great as in 791! 

Validity of Regents review ratings varies as much from school to school 
as validity of school ratings varies and is only slightly greater on the whole 
than validity of school ratings. — The figures in columns (3) and (4) in 
Table 21 show that while the Regents reviewing does slightly increase the 
accuracy of the school ratings on the old-type papers, when the total 
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populations of each of the classes are treated as groups, yet the reviewing 
does not nearly raise the accuracy of the marking in the vast majority 
of the schools to the high level of accuracy achieved by such schools as 
those named above. Indeed, the correlations in column (3) between 
new-type and review ratings vary as much as those in column (2). One 
would not expect the reviewing to improve on the correlations of 0.75 
and above, which the best schools give, because these coefficients probably 
represent the maxinmium validity of which a ninety-minute old-type exami- 
nation is capable; but it is certainly within reason to expect that thorough 
reviewing would raise validity coefficients which are as low as 0.517 to 
0.70, that is, to a figure approximating the average of the validity coeffi- 
cients from the dozen schools in the state which are known to be among 
the most accurate. These large schools show that it is possible to reduce 
the standard error of estimate to an average of about 17 new-type score- 
points for French II, and 15 for French III; but all schools having less 
than 50 students in French II give standard errors of from 22 to 27 new- 
type score-points, and all schools having less than 40 French III students 
give standard errors of from 17 to 27 score-points. These standard errors 
are in no case significantly reduced by the Department reviewing: if for a 
given school, the school ratings give a small standard error (i.e., a high 
correlation) the review ratings will do the same, within a point or two; 
if the school ratings show a large standard error (1.e., a low correlation), 
the review ratings also will show a low correlation with new-type scores. 
Thus, the reviewing turns out to be, in effect, a fairly general acceptance 
of the school ratings. In some cases the review marks give correlations 
with new-type scores actually lower than those which the school ratings 
give: comparison of columns (2) and (3) shows that this happens in about 
a third of the cases in Table 21. In a later section of this report we shall 
show that the reviewing fails also to equalize the varying standards of 
the schools. 

These facts signify a serious and genuine breakdown of the Regents 
examinations. They show that the state examination system is not, in 
fact, giving the schools uniformly and highly accurate educational measure- 
ments nor maintaining uniform standards of achievement for all the 
schools, but is to a large extent simply accepting the ratings of the schools, 
with most of the local variations in standards and accuracy unaltered, 
and is “giving” them the name of Regents Grades. What is the reason 
for this failure to achieve the very ends for which the Regents examina- 
tions were mainly established, and the accomplishment of which are so 
essential to sound educational administration and to definable progress? 

Reasons for failure of Regents reviewing to correct inaccuracies of school 
ratings on old-type Regents examinations. — It would manifestly be ab- 
surd to charge this to the readers or to the State Administration. The 
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readers are all thoroughly trained as readers, have had a minimum of several 
years of teaching experience, and work under the immediate direction 
of the State Supervisor. They do their work devotedly, and take just 
pride in it. The real causes of the failure are not hard to find. 

Not all papers are reviewed. — In the first place, not all of the papers 
are reviewed. The number of readers provided is so small in relation 
to the number of papers to be read that the State Department is com- 
pelled to rely on the “sample method” of reviewing. All the papers 
from a given school must be judged on the basis of a review of ten or 
fifteen per cent of them. 

Some reviewing is misplaced. — The reviewing that does take place is 
frequently misplaced; that is, some papers from a given school that need 
reviewing may escape it, and some that do not need reviewing get it; 
and some schools that need extensive reviewing escape with a “fortunate” 
sampling, and others that do not need it have the time of the reviewers 
taken up needlessly because of an “unfortunate”’ sampling. This comes 
about because the sample itself is the only guide which the readers have 
in selecting papers and schools for review. The full quota of reviewing 
time may be expended on papers from a given school before the reader 
is enabled to make a reliable judgment as to how complete a review that 
particular school ought to have. The size of the sampling of papers 
reviewed from a given school depends on the quality of the first papers 
that happen to be chosen to make up the sample. As an illustration of 
misplaced reviewing efforts, the case of schools 791 and 862 may be men- 
tioned again. As indicated above (page 132), the French III school ratings 
in the former give a correlation of 0.810 with new-type and in the latter of 
0.517. If the new-type has any reasonable validity, 862 clearly needed 
reviewing more than 791; but the correlations between review and school 
ratings for these two schools are 0.969 and 0.996, respectively, showing 
that the school most in need of review had its ratings accepted practically 
without change. 

Review ratings are not independent. — A third disadvantage under which 
the Department reviewing suffers is that the Regents system does not 
provide for genuinely independent ratings by the Department readers. 
Under the present system the Department reader can hardly avoid seeing 
the marks of the school readers, and it is well known that truly inde- 
pendent ratings are not possible under such conditions. The Depart- 
ment reader, however careful and conscientious, cannot entirely overcome 
the “drag”’ effect which his knowledge of the school readers’ marks gives 
to the school ratings; the tendency, therefore, is to make changes only 
when the Department reader is fully convinced of the errors in the school 
ratings, which means that many small errors escape correction in papers 
that are reviewed. The dependent character of the Department review- 
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ing is a more serious fault than it appears to be, especially in conjunction 
with the other weaknesses here listed. 

Unclaimed papers very rarely reviewed. — Except in rare cases, the 
Department reviewers never see the papers not claimed by the schools. 
In the ordinary routine of the Regents examinations, only papers claimed 
by the schools are sent to Albany for review. If the reviewers become 
convinced that a certain school has marked too severely, a special case 
is made of that school, and it is requested to send up its failed or un- 
claimed papers. This weakness in the Regents system might have been 
mentioned as an aspect of the first defect described above under the head 
of incompleteness; because in effect it means that school ratings on about 
one-fifth of all the papers are automatically accepted by the State De- 
partment, and these ratings become official Regents Grades forever after. 
Thus about two out of three students that are failed on the Regents exami- 
nations are failed on local standards by the schools, without benefit of 
Department reviews! But this aspect of the incompleteness of the 
Department reviewing is mentioned separately because it indicates 
at least a slight survival in the Regents system of the ancient puni- 
tive and unconstructive theory of examinations. The fact that only 
claimed papers are regularly sent up for review is, on its face, a tacit 
reversal of the legal doctrine that innocence must be assumed until guilt 
is proven beyond reasonable doubt. It says, in effect, that the State 
Department assumes that all the schools will “claim” all the papers 
which are entitled to a Regents passing mark, or more, and that no school 
will fail to claim all papers that it ought to claim. Aside from the re- 
flection on the integrity of the school reader, this is a singularly uncon- 
structive practice, both in theory and effect. The State Department 
ought to be more zealous in protecting the educational interests of good 
students than it is in keeping some schools from making a false showing 
of high achievement. The practice of reviewing only claimed papers, 
and that of confining the reviewing of these mainly to those just above the 
passing line established by the schools, asswmes too much about local 
school standards which the Regents examinations were established to ver7fy. 

Of course, this is in no way a criticism of State Department officials. 
With the provisions available, the Department could hardly do more or 
better reviewing than it is doing. The present experiment, and others 
like it which the Department has carried out, show that the state educa- 
tional officers are keenly aware of the weaknesses of the Regents system, 
and are seeking constructive and sound remedies as rapidly as possible. 

Fundamental weakness inherent in old-type subjective form of examina- 
tions. — The four weaknesses thus far described have had to do mainly 
with the methods and conditions of the Department reviewing which keep 
the review ratings from being notable improvements on the school ratings 
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as to accuracy; but the fundamental weakness of the Regents system, 
from which all the defects described arise, is inherent in the nature of the 
old-type examination. Even the results from the most accurate schools fail 
to approach what might be considered as satisfactory reliability and valid- 
ity coefficients, so that reviews would in no case give entirely satisfactory 
results no matter how extensive they might be. The old-type form of 
examination is subjective in both structure and scoring, it can include 
only very inadequate samplings of materials and performances in exam- 
ination periods of normal length, it is incapable of giving comparable 
measures, particularly between different year-classes, and finally, as we 
shall show in a later section, it is extravagantly costly in comparison 
with other more reliable forms of examinations. The objective or new- 
type forms of tests used in this experiment will go far toward eliminating 
all of these weaknesses. 

This does not mean that old-type forms of questions should be aban- 
doned. On the contrary, our data show that the old-type examinations 
have considerable measurement value, when the papers are carefully and 
completely reviewed by a central agency. But in no case does it appear 
that the old-type is more than a partial measure of total modern language 
achievement. The old-type examination at zts best gives no better indi- 
cations than we have secured from the least reliable and least valid of 
the three Parts of the new-type examinations. There is no more reason 
for limiting the Regents modern language examinations to old-type 
questions than for limiting them to true-false reading questions or to 
multiple choice vocabulary questions. The two types supplement each 
other; the answer, therefore, is to use a judicious combination of both 
types. We shall thereby minimize the weaknesses and smooth out the 
defects of both while surrendering the real values of neither of the two 
types of examinations. 

Classes in large schools are more homogeneous and achieve higher 
average scores than in small schools. — A fifth indication in Table 21 
is that the classes are more homogeneous in the larger schools than in the 
smaller. The figures in column (5) show that the standard deviations 
become progressively larger as we go from schools having fifty or more 
students in French II and forty or more in French III, through the smaller 
schools to senior schools and academies. The average standard devia- 
tion for the French II classes in large schools is about twenty-five score- 
points and in academies almost thirty-four points. The corresponding 
standard deviations for French III classes are 21.6 and 35.6 score-points. 
Only two of the twelve schools represented in Table 21 having more than 
fifty students in French II have sigmas above twenty-six points; one of 
these (No. 369) is a private school, with standard deviation of 38.7 points, 
and the other is School 919 with standard deviation of 29.38 points. 
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Inspection of column 8 will show also that the average scores become 
progressively smaller as we go from large to small schools. Only three 
or four of the large schools are below the state-wide averages for French II 
and IIT; and of these few only one, No. 369 (a private school) is more than 
five points below average. The average new-type score of the French II 
class in School No. 369 is seventy-two points below the state-wide average. 

Table 21 thus shows three important differences, which seem to be 
highly interdependent, between the large schools on the one hand, and 
the small schools and private schools on the other: (1) the large schools 
furnish more accurate ratings on the old-type Regents examinations than 
the smaller schools and academies; (2) the classification of students is 
better in the large schools, the classes in some small and private schools 
being so heterogeneous as to make efficient teaching very difficult, if not 
impossible; and (3) the average achievement in large schools is uniformly 
at or above the state-wide average, while that of the smaller and private 
schools is consistently and often considerably below the state-wide aver- 
age. These differences hold for both second- and third-year classes, being 
slightly more pronounced for the latter. 

Proportions of students of French in various classes of schools. — The 
sixth and last fact that will be specifically mentioned in connection with 
Table 21 is derived from column (11), which shows the numbers of stu- 
dents from each of five classes of schools that took the Regents examina, 
tions in June, 1925. It appears that about 9% of all the French II stu- 
dents in the state come from academies, 2.6% from senior schools, and 
about 20% from schools having fewer than fifteen students in French II. 
The corresponding per cents for French III are 11%, 0.8%, and 10%. 
The average scores in both French II and III of all three of these groups 
of schools are below the state-wide average, particularly of the senior 
schools and academies. Of the 20,000-odd students taking either the 
French II or French III Regents examinations in June, 1925, nearly 
8000, or nearly 40%, had their French instruction in academies, senior 
schools, or schools having fewer than fifteen students in French II or 
French III. More than a fifth of the French II and III students in New 
York State are in schools in which the teacher of French has fewer than 
fifteen students in the second- or in the third-year class in this language. 

This discussion of Table 21 applies with only small modifications to 
Tables 22, 23, and 24, which give data on the Spanish, German, and 
physics examinations paralleling those which Table 21 gives for the 
French examinations. The reader will readily understand the slight 
differences between the tables. They will therefore be presented without 
comment, beyond calling the attention of the reader to the fact that the 
Department ratings on the old-type examinations for Spanish II and III 
seem much less accurate, in terms of agreement with new-type results, 
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than the school ratings on these examinations. This is strong confirma- 
tion of the suggestions already made (pages 133-136) concerning the 
Regents reviewing system. 


TABLE 22 


INTERCORRELATIONS BETWEEN NE W-TYPE SCORES, SCHOOL RATINGS AND REVIEW 
RATINGS ON OLD-TYPE EXAMINATIONS IN SPANISH FOR INDIVIDUAL SCHOOLS AND CLASSES 
OF SCHOOLS, SHOWING VARIATIONS IN ACCURACY OF SCORING IN DIFFERENT SCHOOLS, 
AND SHOWING THAT SOME SCHOOLS THAT MOST NEEDED REVIEWING ESCAPED IT. THE 
VARIATIONS IN CORRELATIONS ARE PARTLY ACCOUNTED FOR BY DIFFERENCES IN VARIA- 
BILITIES OF INDIVIDUAL SCHOOL GROUPS. NT MEANS NEW-TYPE, S MEANS SCHOOL 
RATINGS, AND R REVIEW RATINGS. 


Spanish II 
Sonn EyARONS SIGMas MEANS 
ScHoot No. am NE s 
walle a1 Meee NT s R NT s R N 

SOT ee RSET iON 9867 32:03) Ole Sc p2. WAS Oh iocomd bale 109 
AGG fern edode a (Worle. 7 9Gnl| 25:0) 8.43 | 7.96 || 162.29 | 80.59 | 72:96 85 
S04 ee Se eOliSa|oss: le o4onlle2oc domes res sy 130:0- W66.7- | 65.55 40 
SN 4 A Wt Cte) SP) Se Mi PALS Ie| 7a 7.76 || 174.25 | 75.5 | 74.03 213 
SIO ea O44 Sorel ISor 24292. oes 8.32 || 156.8 | 72.53 | 72.05 38 
862 >. ss 688) 2799) |-998))\| 2728 8.44 | 8.5 EOS “45. 74) F570 257 
Oa een een OL Pe OSON 225) 2.84 42 || 128.4 | 65.6 | 64.13 47 
C1 Oe | Olllan ie GOLa O02 ateOna 6.04 | 5.88 |! 181.5 | 68.4 | 69.34 89 
Average 40 + | .709 | .633 | .775 ||(28.0) 
30-40 2 6635-620) 9025) 282568 OD ie eso 159.48 | 74.51 | 73.69 268 
15-30 SNS S3) S20 954s iS O4 7.16 | 7.54 || 155.08 | 72.48 | 71.48 658 

1-15 ee OO ln me Or cer tee Sian leagues Cpe P| ay eos) 154.3 | 72.61 | 71.02 609 
Academies . | .687 | .682 | .997 || 30.58 | 7.67 | 7.00 || 153.8 | 71.22 | 69.96 284 
Total Sp. I .699 | .551 30.87 | 7.97 | 7.99 || 164.35 | 74.08 | 72.91 | 5340 

J 
Spanish III 

696 811 | .510 | .947 || 23.8 8.12 | 8.39 || 174.87 | 72.93 | 70.53 30 
793 676 | .668 | .976 || 22.06 | 6.82] 7.16 || 198.2 | 72.52 | 72.94 100 
799 609 | .641 | .822 || 18.91 6.8 8.07 || 202.8 | 72.41 | 76.89 47 
800 618 | .596 | .880 || 23.62 | 7.22) 6.59 || 180.58 | 72.71 | 70.15 188 
806 731 | .746 | .938 || 27.24 | 8.62] 8.54 || 207.46 | 79.42 | 79.45 ae 
810 610 | .651 | .806 || 18.28 | 6.80} 6.07 || 197.54 | 78.31 | 74.00 59 
SIT ce wml (ASulee (ov le OOS |e 2OlOOnmcee 7.93 || 200.438 | 73.19 | 73.05 160 
892 R672 2668) | 297822. 7 Ou Shas Sire. il LOS tt i7G-07 tere eG 184 
COL eld less OoonleOOSaeOssOnlmore 6.83 || 189.25 | 73.69 | 73.22 107 
O71 . « .« | 09d) -58L |98T |) 16994 1" F16 \"7200" il 205.8451 74.38) 174.97 58 
984, 5, |) OOO MAGS |) 298451925:.527) “6.74 6.61 4 2138°95 177.255 78000 48 
Average 30 + | .663 | .629 | .924 ||(22.06) 
15-30 5g |) OO one 975 24.25 | 7.27 | 7.48 || 185.06 | 72.90 | 73.36 | 332 

1-15 . .« | .658 | .682°| .870 || 27.00 | 7.92") 7.93 I) T8803 1 74.73.) 7204. 312 
Academies . | .680 | .654 | .840 || 28.45 | 10.11 | 8.96 || 188.0 | 76.08 | 72.36 170 
Total Sp. III | .670 | .654 25.2 8.18 | 8.09 || 192.7 | 74.65 | 73.88 | PAR 
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TABLE 22 (Continued) 
Spanish IV 


Total Sp. IV | .696 


20.5 


718 | 


S88 9.64 | 


222.17 | 80.27 | 79.19 | 226 


TABLE 23 


INTERCORRELATIONS BETWEEN NEW-TYPE SCORES, SCHOOL RATINGS AND REVIEW 

* RATINGS ON OLD-TYPE EXAMINATIONS IN GERMAN FOR INDIVIDUAL SCHOOLS AND 

CLASSES OF SCHOOLS, SHOWING VARIATIONS IN ACCURACY OF SCORING IN DIFFERENT 

SCHOOLS, AND SHOWING THAT SOME SCHOOLS THAT MOST NEEDED REVIEWING ESCAPED 

Ir. THE VARIATIONS IN CORRELATIONS ARE PARTLY ACCOUNTED FOR BY DIFFERENCES 

IN VARIABILITINS OF INDIVIDUAL SCHOOL GROUPS. NT STANDS FOR NEW-TYPE, S FoR 
SCHOOL RATINGS, AND R. FOR REVIEW RATINGS. 


German II 
SOS SEENTIONS SIgMas Means 
Scuoot No. NT NT S 
s R R NT s R NT Ss R N 
696 7TA7 | .747 |1.000 |/32.18 8.48 | 8.48 || 135.37 |73.85 | 73.85 54 
793 675 | .679 | .989 ||19.05 7.389 | 7.40 || 186.82 |76.52 | 76.55 66 
8380 777 | .771 | .985 |/19.3 6.32 | 6.45 || 198.33 |81.89 | 82.0 36 
862 .609 | .646 | .994 |/23.8 6.38 | 6.18 |} 200.36 |85.71 | 85.71 56 
884 568 | .581 | .986 ||32.0 11.66 | 11.26 || 158.23 |78.82 | 78.21 34 
892 671 | .661 | .982 |]18.75 6.82 6.66 || 207.21 |84.71 | 84.79 68 
896 .657 | .624 | .986 |/27.45 7.9 8.19 |} 198.88 |81.24 | 81.46 37 
901 .782 | .794 | .998 ||34.8 See 2 Aes 147.86 |77.31 | 77.03 70 
954 .83 815 | .992 |/24.59 6.71 | 6.4 182.44 |83.16 | 83.35 43 
962 469 | .445 | .957 ||28.53 6.25 6.07 || 192.43 |81.66 | 82.06 35 
971 488 | .420 | .963 |/18.68 5.34 5.48 || 194.0 |82.6 84.12 50 
984. .781 | .757 | .950 |/19.485 | 5.37 | 4.527]| 219.5 |85.6 86.05 40 
10389 . 638 | .679 | .976 ||26.72 | 11.2 | 10.66 || 154.48 |78.85 | 77.80 69 
Average 30 + | .669 | .663 | .981 |)(24.0) 
15-380 * 708 | .689 | .981 ||35.49 9.54} O51 180/90 I7777 78.22.) 303 
1-15 629 | .691 | .910 |/389.55 7.45 G29 ||| lOO: |(O.808 | 7 0.Do 324 
Academies 655 | .753 | .910 |/46.6 9.62 | 9.73 || 155.12 176.29 | 75.42 | 163 
Total Ger. II 669 | .719 | .951 ||89.09 8.84] 9.02 || 172.1 |78.498 | 78.42 | 1482 
German III 
Total Ger. III | 711 | .756 | 33.56 | 9.2 | 9.29 | 202.9 7.2 | 77.15 | 667 
German IV 
Total Ger. IV | .640 | 723 | 25.4 | 8.52 | 8.38 | 228.07 79.65 | 79.35 | 127 
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TABLE 24 


INTERCORRELATIONS BETWEEN NEW-TYPE SCORES, SCHOOL RATINGS AND REVIEW 
RATINGS ON OLD-TYPE EXAMINATION IN PHYSICS FOR INDIVIDUAL SCHOOLS AND CLASSES 
OF SCHOOLS, SHOWING VARIATIONS IN ACCURACY OF SCORING IN DIFFERENT SCHOOLS, 
AND SHOWING THAT SOME SCHOOLS THAT MOST NEEDED REVIEWING ESCAPED IT, THE 
VARIATIONS IN CORRELATIONS ARE PARTLY ACCOUNTED FOR BY DIFFERENCES IN VARI- 
ABILITIES OF INDIVIDUAL SCHOOL GROUPS. 


Physics 
CCRC Siemas | Means 
Scnoot No. Fe, ne E | 
he ae NN 2 NT s R NT s R N 
337 . . . | 474] 415 | .849 || 19.45 | 10.27} 9.04 || 92.64 | 80.83 | 80.65| 147 
369 . . . | .818] .290] .493 || 17.65] 7.78} 4.28 || 126.3 | 67.94 | 65.91 64 
696 . . . | .507 | .475 | .933 || 20.65 | 10.31 | 9.5 81.74 | 81.13 | 83.14 | 152 
800) 27 2 12583i| 25617) 2955) 52255 LO 8G MOGs 87.03 | 81.22 | 81.27] 411 
819 . . . | .632 | .650 | .918 |} 21.6 | 11.9 | 10.14 ]| 90.59 | 79.09 | 76.64 81 
804 . . . | 501) .506 | .906 |} 20.45} 9.16} 9.12 || 104.46 | 71.5 | 71.78} 102 
862 . . . | 842) .271 | .840 |} 18.6 | 10.02} 9.34 || 82.69 | 82.54 | 82.19} 129 
863 . . . | 654] .556 | .899 || 20.05 | 10.83 | 8.9 95.45 | 83.18 | 81.41 61 
884 . . . | .686| .631} .930 || 24.1. | 11-76 | 11.5 89.05 | 75.5 | 76.05} 140 
918 . . . | .563 | .594 | .900 |} 21.85] 9.74] 9.88 || 93.42] 75.4 | 74.39] 164 
942 . . . | .517 | .515 | .978 || 20.55 | 11.21 | 11.00 || 87.34 | 81.43 | 81.56 | 123 
971 . . . | .524] .561 | .926 |} 19.81 | 10.89 | 10.08 |} 85.01 | 82.88 | 83.02] 338 
Averages . | .521 | .502 | .877 ||(20.55) 1912 
Total Physics | .556 | .582 | .944 |) 22.9 | 10.7 | 10.35 || 97.59 | 77.32 | 76.66 |/14081 


School differences with respect to achievement in vocabulary, reading 
and grammar. — In the preceding discussion attention was called to the 
fact that teaching conditions in small and private schools are considerably 
less favorable than in large schools because of the greater heterogeneity 
of the classes in the former. The classification of students for instruc- 
tional purposes is better in the large than in the small schools, but even 
among the large schools there are notable variations in kinds and amounts 
of modern language achievement which have considerable significance both 
for the theory and practice of measurement and for the structure of the 
curriculum in the modern languages. Table 25 summarizes some of these 
variations for second- and third-year French classes in the large schools 
mentioned in connection with Table 21. 

The general conclusion from Table 25 is that even in the largest modern 
language departments instruction is not adapted to individual needs of 
students, and that different schools emphasize different aspects of the 
language, so that achievement is very uneven both for individual stu- 
dents and for individual schools. 

Homogeneity of classes. — Columns (5), (6), and (7) in Table 25 show 
large variations in degree of homogeneity of the classes in different schools 
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TABLE 25 


SCHOOLS 


141 


INTERCORRELATIONS OF THE VOCABULARY, READING AND GRAMMAR PARTS OF THE 
NEW-TYPE FRENCH EXAMINATION, BASED ON RESULTS FROM INDIVIDUAL SCHOOLS, 
WITH MEANS AND SIGMAS, SHOWING ‘“‘UNEVENNESS”’ OF ACHIEVEMENT OF INDIVIDUALS 
AND OF SCHOOLS IN THESE THREE ELEMENTS OF MODERN LANGUAGE ACHIEVEMENT. 


French II 
ONE SIGMAS MeEans 
Scnoot No. Tand [I i] ttand N 
| ut | mw I II III I iat UI 
(1) (2) | @) | @ (5) (6) (7) (8) Oy Woe Cop) 

sbyé ~029 | 052°] .492 8.50 | 10.40 | 10.48 || 57.89 | 45.25 | 50.44 102 
369 7772 | .733 | .777 || 18.45 | 15.35 | 13.60 || 38.97 | 22.67 | 18.88 58 
591 .607 | .686 | .637 8.30 | 9.65 | 12.18 || 57.08 | 43.35 | 45.38 59 
791 O01 | .553 | .483 7.85 | 10.90 | 10.35 || 68.57 | 53.57 | 58.71 70 
806 .546 | .610 | .631 || 10.4 9.3 | 10.1 65.51 | 48.48 | 56.69 178 
819 .597 | .488 | .475 8.28 | 9.25 | 10.23 || 59.62 | 42.67 | 45.27 144 
862 .522 | .597 | .491 8.35] 8.73] 8.88 || 66.13 | 50.39 | 59.21 335 
884 .042 | .575 | .574 9.95 | 9.3 | 12.15 || 57.61 | 46.25 | 52.98 143 
919 .633 | .598 | .725 || 10.7 | 11.75 | 11.55 || 58.389 | 42.22 | 51.95 129 
696 .638 | .654 | .617 7.22} 9.8 | 10.75 || 62.72 | 50.45 | 54.78 132 
942 .677 | .597 | .623 8.53 | 10.5 | 10.83 !| 63.79 | 48.38 | 53.14 62 
971 .573 | .667 | .620 UASPAN| Rowrive || ues 70.67 | 52.42 | 59.66 241 
Total Fr. IT 658 | .714 | .699 || 11.05 | 11.85 | 13.5 57.61 | 44.64 | 48.47 | 13486 

French III 
337 .432 | .623 | .547 7.02} 7.02 | 8.91 || 76.94 | 58.55 | 65.13 61 
696 .647 | .682 | .517 5.382} 8.91] 9.69 || 76.06 | 57.88 | 62.43 68 
701 .575 | .606 | .591 6.26 | 8.51 | 10.53 || 76.31 | 59.5 | 62.31 48 
791 431 | .523 | .509 6.63 | 6.96 | 8.97 || 75.63 | 58.17 | 62.03 45 
806 B1O2 OOo LOOM LL. Ole con La. ove li acom |PODate | Olas 66 
819 .607 | .561 | .529 7.56 | 8.67 | 10.68 || 72.66 | 55.22 | 61.5 43 
862 .457 | .465 | .425 7.14} 7.31) 6.54 || 76.87 | 56.03 | 68.05 235 
881 457 | 612 | 072 9.24 | 10.38 | 10.71 || 68.54 | 52.61 | 58.87 46 
883 .697 | .534 | .583 9.00 | 7.83 | 9.57 || 68.43 | 52.85 | 55.97 43 
884 .063 | .562 | .494 8.43 | 7.44] 9.12 || 74.14 | 55.95 | 64.82 56 
919 .618 | .657 | .549 9.02} 8.81] 9.86 || 71.66 | 52.45 | 59.389 57 
971 .469 | .504 | .846 7.56 | 6.60 | 8.46 || 79.81 | 59.16 | 66.84 115 
984 489 | .642 | .506 8.21 | 8.19 | 10.89 || 73.78 | 56.21 | 68.18 263 
Total Fr. III 624 | .652 | .609 || 10.20] 9.65 | 11.55 |) 70.37 | 53.75 | 60.26 | 6741 

French IV 
Total Fr. IV | .550 | .698 | Oo lL | 10.26 | 8.18 | 11.91 | 79.16 | 59.68 | 67.50 | 489 


with respect to each of the three aspects of achievement measured by the 
The French II class in school 696 is very homo- 
geneous with respect to vocabulary, and fairly so with respect to reading 
and grammar; by contrast, the French II class of school 806 is more 


new-type examinations. 


142 NEW-TYPE MODERN LANGUAGE TESTS 


homogeneous with respect to reading and grammar than 696, but very 
much more heterogeneous with respect to vocabulary than 696. There 
are several variations of this sort in Table 25 which the reader may find 
for himself. Space limitations force us to confine ourselves to specific 
mention of only one, and to point out that the teaching problem in school 
696 is very different from that in 806. Half the teaching battle is won 
when we have found just what the student already knows and therefore 
just what he still needs to learn; the battle is as good as lost when we 
“shoot in the dark” by teaching without first ascertaining Just what 
ought to be taught to a given student or group of students. The new- 
type objective tests will avail much in this matter of defining immediate 
goals and in directing the efforts of teachers under varying classroom 
conditions. It is this capacity of the new-type tests to give constructive 
information that raises them above the level of mere “flunking” and 
punitive devices, and recommends them to wide use not only as final 
examinations but as classroom instruments of indispensable pedagogic 
values. 

Unevenness of achievement— Columns (8), (9), and (10) in Table 25 
show that some schools are of high average achievement in one respect, 
e.g. active grammar, and low in vocabulary or reading; and others are 
high in vocabulary but low in reading and grammar. This is undoubtedly 
mainly due to differing emphases on the part of the teachers in different 
schools. Some teachers believe in extensive reading, and tend to neglect 
the other things; other teachers have great faith in active grammar, and 
exercise their students so intensively in it that little time or energy is 
left for vocabulary, reading, oral-aural practice, etc. Thus, without know- 
ing or desiring it, teachers sacrifice fully-rounded progress of their pupils 
to their personal predilections and individual theories of the proper study 
of languages. The use of objective and standardized tests which repre- 
sent each of the various aspects of modern language achievement will 
save both teachers and pupils from such errors. It has obviously 
been impossible to secure such constructive guidance from the old-type 
subjective examinations. Their subjectivity alone is sufficient to pre- 
vent them from performing such service. There are some teachers who 
believe they can tell from a student’s translations of two or three brief 
passages how much grammar, vocabulary, reading ability, etc., he has; 
but this is obviously unwarranted faith. 

Intercorrelations of new-type Parts for individual schools. — The inter- 
correlations of the Parts shown in columns (2), (3), and (4) are fairly 
constant for all the schools represented in Table 25. This fact confirms 
the indications of earlier tables that the vocabulary, reading, and grammar 
aspects are definitely interdependent, yet specific and independent enough 
to require separate measurements. There appears to be no departure 
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from this rule. In no school would a test of vocabulary alone, reading 
alone, or grammar alone, give an adequate measure of total modern 
language achievement. From this we may infer that an examination 
which consists mainly of written translation exercises is not broad enough 
to afford adequate measures of total language achievement, for if vocabu- 
‘lary and silent reading are independent enough to require separate measure- 
ment, surely so distinct a skill and so questionable a discipline as turning 
one language into another cannot be considered as representative of total 


TABLE 26 


INTERCORRELATIONS OF THE VOCABULARY, READING AND GRAMMAR PARTS OF THE 
NEW-TYPE SPANISH EXAMINATION, BASED ON RESULTS FROM INDIVIDUAL SCHOOLS, 
WITH MEANS AND SIGMAS, SHOWING ““UNEVENNESS”’ OF ACHIEVEMENT OF INDIVIDUALS 
AND OF SCHOOLS IN THESE THREE ELEMENTS OF MODERN LANGUAGE ACHIEVEMENT. 


Spanish II 
CoRRELATIONS STMT Nips 
BETWEEN Parts 
Ss No. N 
po oy at atrial Pa II II I II Ul 
(1) (2) | 3) | @ (5) (6) (7) (8) (Qe Oy =) ot) 
Se ee ee OD I eCOSi1 054.1) 22890) ORG ili ou Od los ln4ge Gall aaieo 109 
466) ee) eO28eeco9 | 62211 Or. 9.6 | 10.95 || 72.25 | 51.7 | 38.9 85 
S04 ee eee AOS Ooi | 44) Lda 10.55 | 10.86 || 56.6 | 40.55 | 34.1 40 
S06). ee eply 1684) (515 10/63 |" 8:79") 1383) 74.574) 50:38 1.50.59 213 
SPOR oe pete |e (O02 FH eoo0| POOO euey i APA a4 65.7 | 51.03 | 40.8 38 
SG Qe ee DO Le cOol IO 8.67 | 8.61 | 15.2 73.15 | 51.67 | 46.4 257 
Ole oe een | eool | 49% 4255 10:7 8.85 | 6.81 || 57.2 | 44.75 | 25.1 47 
C1 Om ee eee OLO fa. G5: |OoS)| Osea at TeLGn 1228 ORS) | ABI. PES RIL 89 
Total Sp. I 629 -702 | .578 || 11.48 | 9.69 | 14.16 || 70.11 | 51.68 | 43.12} 5340 
Spanish III 
696 616 | .631 | .596 8:00 | 6.65 | 12.338 || 75.4 | 54.7 | 40.3 30 
793 417 | .669 | .521 6.88 | 6.89 | 11.77 || 83.36 | 59.5 | 56.06 100 
799 397 | .771 | .3881 G:SS imo.) elude? 83.61 | 59.36 | 57.96 47 
800 Sia | 0a | OLS 9.57 | 7.97 | 10.35 || 74.21 | 54.48 | 49.86 188 
806 669 | .595 | .565 9.29 | 8.15 | 12.7 84.39 | 57.42 | 62.98 U0 
810 202 | .340 | .607 7.87 | 6.48 | 9.14 |} 81.15 | 58.99 | 55.01 59 
811 421 | .660 | .425 6.6 G'Gaaleniee 83.63 | 54.91 | 56.46 160 
892 410 | .630 | .490 CSL CON Tet NiSh.O2 [ROO 2i 7080 184 
9001 . . . | 421 | .584| .411 7.25 | 7.08 } 10.55 || 81.03 | 57.07 | 51.45 107 
671. . . | 1361 | 386 |.458 4.81| 6.00] 9.96 || 83.14 | 60.5 | 63.03 58 
OS re O24 739 O22 6.55 | 7.02 | 11.67 |} 86.19 | 60.83 | 66.02 48 
Total Sp. IIT | .522 | .605 486 8.79 | 7.83 | 12.78 || 80.9 | 57.66 | 53.95 | 2a2 
Spanish IV 
Total Sp. IV | 378 | .573 | 417 | 6.57 | 6.81 | 12.15 | 89.69 | 63.82 | 69.77 | 226 


144 NEW-TYPE MODERN LANGUAGE TESTS 


modern language achievement in the high school stages. In a final exami- 
nation which aims to measure total achievement and to give the teacher 
information that may be used constructively in the classroom, tests of 
vocabulary and silent reading ability are just as necessary as tests of 
grammar, translation, and oral and aural skills. 


TABLE 27 


INTERCORRELATIONS OF THE VOCABULARY, READING AND GRAMMAR ParRTS OF THE 
NEW-TYPE GERMAN EXAMINATION, BASED ON RESULTS FROM INDIVIDUAL SCHOOLS, 
WITH MEANS AND SIGMAS, SHOWING “ONEVENNESS”’ OF ACHIEVEMENT OF INDIVIDUALS 
AND OF SCHOOLS IN THESE THREE ELEMENTS OF MODERN LANGUAGE ACHIEVEMENT. 


German II 
ScHoou No. a, chitane N 
iL He UL it II II I II III 
(1) A) ey) | ey (5) (6) (7) (8) (9) (10) | (11) 
696 og | CCSD eS Ih ee Ih Tile ass 9.975 | 12.9 65.89 | 41.89 | 30.45 54 
793 eA 9u oGZaleo4+5 7.08 LOOSE LEZ bn Szeknosceye tae 66 
830 , » | Or | SOS | 22% 7.16 6.95 | 10.13 || 84.0 54.67 | 60.58 36 
862 , .. | .675 | 485 | .409 Gow Teal) \ 12°39 |WSEZE. | 55282 | 5S-64 56 
884 . 5 | aXe | sete |] ater! fh a8 10.66 | 13.73 || 73.06 | 46.99 | 37.59 34 
892 » on || EN BG || Ue 6.52 6.21 11.92 |} 88.5 57.13 | 65.69 68 
896 . . | .086 | .649 | .619 9.48 8.05 | 13.8 SO.205 1 o.08 P pa<58 37 
901 og OA OTL AL AK) II Ee 11.8 14.88 || 70.2 45.0 | 36.0 70 
954 GOLA S25-330 9.59 9.69 | 10.9 Stb2. | 52.997) SLSS. 43 
962 so | Or |) Geo) 4ale 8.37 9.75 8.89 || 81.13 | 54.13 | 61.41 35 
971 yg aOR) Seve sever MOO 6.47 10.3 85.44 | 55.08 | 57.9 50 
984 . . | .028 |.429 || 483 6.596 | 7.999 | 9.86 || 89.625 | 57.15 | 73.5 40 
1039 . . | .618 | .499 | .509 || 10.0 10.3 12.54 || 68.48 | 46.61 | 38.67 69 
Total II 183) 685 |.533" | 145 8.21 iit 76.44 | 48.51 | 46.86 | 1482 
| 
German III 
Total III . | .7AL | .730 | .632 9.77 9.45 | 9.46 | 88:28 | 57:2 | 57.11 667 
German IV 
TotalIV . | 443 | A87 | 468 6.64 7.03 | 15.95 | 94.39 | 63.54 | 68.95 | 127 


Tables 26 and 27, relating to Spanish and German results, are similar 
to Table 25 and confirm its indications. Indeed, the variations in degree 
of heterogeneity of classes with respect to vocabulary, reading, and gram- 
mar are greater in Spanish and in German than in French classes. For 
example, the Spanish IT class in school 917 is only half as variable with 
respect to grammar as most of the other schools; this is also true of the 
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Spanish III class in school 892, and to a lesser degree of the German II 
class in school 962. Several of the Spanish III classes represented in 
Table 26 are about twice as heterogeneous with respect to vocabulary as 
the Spanish III class in school 971. The increase in homogeneity from 
second- to fourth-year classes is much greater in Spanish and German 
than in French, although the numbers of students in the fourth-year 
classes in Spanish and German are too small to afford very reliable indi- 
, cations. The general conclusion from Tables 25-27 is that the new-type 
tests have uncovered variations in teaching conditions and in teaching 
problems in different schools which the old-type tests could not reveal; 
that, in addition to giving more reliable and more complete measures of 
total language achievement than the old-type, the new-type tests give 
more specific information which can be used in a constructive manner 
in the classroom in individualizing instruction and in directing the efforts 
of both teachers and students into the most profitable channels. The 
new-type tests will thus perform as great a service when given at the 
beginning and during the school year as at the end. 


IV 


OVERLAPPING OF CLASSES AND VARIABLE SCHOOL STAND- 
ARDS NOT CORRECTED BY REGENTS EXAMINATIONS 


DisrripuTions or Nrew-Typr ExAMINATION Scores BY YEAR-CLASSES 


For the convenience of investigators and others who may use the new- 
type Regents examinations, or one of their equivalent forms, we shall 
now present without comment a series of eight tables showing various 
distributions of scores on the new-type examinations. 

Distributions of new-type total scores. — Tables 28-31, inclusive, show 
distributions of total scores on the new-type examinations in French, 
Spanish, German, and physics, (a) of all students in each class taking 
Regents examinations in June, 1925, (b) of all students in each class 
passed by the old-type part of the Regents examinations, and (c) of all 
students failed by the old-type part of the Regents examinations. 

Some of the facts about the modern language situation in New York 
State which these tables reveal will be elaborated and set forth graphically 
in later sections of this report; but the tables themselves are well worth 
whatever time the reader may spend in studying them. 

No attempt can be made within the space limits at our disposal to notice 
all the interesting features of these tables. Attention, however, may be 
called to one series of comparisons which the reader may make. In Table 28 
the sixth line from the bottom shows the approximate per cents of students 
passed and failed in each year-class by the old-type part of the Regents 
examinations. Why is the proportion of failures in French IV greater than 
in French III? The proportions of students failed by the old-type Regents 
examinations in second-, third-, and fourth-year French are, respectively, 
24%, 15% and 20%. A mortality rate of one in five for fourth-year stu- 
dents, many of whom have attended “five recitations a week”’ for five or 
more years, is quite serious. One would expect that after students have 
been allowed to stay in modern language work for three or more years the 
proportions of failures would be progressively smaller. Table 29 fulfils this 
expectation by showing that the proportions of failures in Spanish steadily 
decrease from second- to fourth-year classes—20%, 15%, and 7%. In ful- 
filling this expectation Spanish stands alone, for in Table 30 we find the 
trend of failing rates exactly reversed, the proportions of failures for second- 
third-, and fourth-year German classes being 11%, 13%, and 13% 
spectively. These figures constitute a serious indictment against the adit: 
tional guidance which is accorded to modern language students in New 
York State. 
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TABLE 28 


DISTRIBUTIONS OF TOTAL SCORES ON NEW-TYPE REGENTS EXAMINATION IN FRENCH, 
JUNE, 1925, OF ALL STUDENTS, OF ‘PASSED’? AND OF “FAILED”? STUDENTS BY YEAR- 
CLASSES, SHOWING MEANS AND SIGMAS, AND APPROXIMATE QUARTILE AND MEDIAN 


PASSED AND FAILED BY THE OLD-TYPE REGENTS EXAMINATION. 
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THE FIGURES IN PARENTHESES SHOW APPROXIMATE PER CENTS OF EACH CLASS 


French 
Niw-T ype Frencu II Frenca III Frencu IV 
Exam. 
Tora Total Passed Failed Total Passed Failed Total Passed Failed 
Scores Popula- |Old-Type}Old-Type| Popula- |Old-Type|Old-Type| Popula- |Old-Type]Old-Type 
tion Exam. Exam. tion xam. xam. tion xam. xam. 
10-19 1 1 
20-29 4 4 1 1 
30 10 10 
40 19 19 3 3 
50 31 31 6 6 
60 76 3 73 t 7 
70 111 6 105 5 5 
80 230 3l 199 8 8 
90 352 (2 280 20 20 1 1 
100 555 178 377 29 4 29 if 1 
110 788 326 462 51 9 42 1 il 
120 1190 669 521 100 Pf 73 1 1 
130 1462 1017 445 183 15 108 2 2 
140 1577 1275 302 295 148 147 7 1 6 
150 1691 1492 199 450 293 157 14 5 9 
160 1613 1514 99 752 590 162 31 18 13 
170 1291 1249 42 921 789 132 38 22 16 
180 1021 1003 18 1069 988 81 37 23 14 
190 735 728 7 1000 970 30 54 39 15 
200 424 424 826 815 iit 75 69 6 
210 197 197 533 530 3 (2 61 11 
220 79 78 J 301 300 1 64 60 4 
230 23 23 139 137 2 49 48 1 
240 6 6 35 35 36 36 
250-259 7 7 6 6 
Total 13486 | 10291 3195 6741 SULT 1024 489 390 99 
Per cents (76%) | (24%) (85%) | (15%) (80%) | (20%) 
Means 150.4 183.4 205.8 
Sigmas 31.6 27.4 27.4 
TG): 130 102 102 167 173 136 187 196 165 
Mdn. 152 160 120 185 189 154 208 213 182 
We.@: 173 179 137 203 214 170 225 229 197 
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TABLE 29 


DISTRIBUTIONS OF TOTAL SCORES ON NEW-TYPE REGENTS EXAMINATION IN SPANISH, 
JuNE, 1925, OF ALL STUDENTS, OF ‘PASSED’? AND OF “FAILED”? STUDENTS BY YEAR- 
CLASSES, SHOWING MEANS AND SIGMAS, AND APPROXIMATE QUARTILE AND MEDIAN 
SCORES. THE FIGURES IN PARENTHESES SHOW APPROXIMATE PER CENTS OF EACH 
CLASS PASSED AND FAILED BY THE OLD-TYPE REGENTS EXAMINATION. 


Spanish 
New-Typre Spanish II Spanisu III Spaniso IV 
Exam. ee Se eee 
Torau Total Passed } Failed Total Passed | Failed Total Passed | Failed 
Scorgs Popula- |Old-Type|Old-Type} Popula- |Old-Type|Old-Type| Popula- |Old-Type|Old-Type 
tion Exam. Exam. tion Exam. Exam. tion Exam. Exam. 
40-49 1 1 
50-59 5 5 
60 9 9 
70 18 1 17 1 1 
80 25 2 23 
90 55 4 dl 1 1 
100 122 19 103 4 4 
110 181 56 125 9 2 1 
120 319 137 182 8 1 7 
130 401 241 160 18 8 10 
140 609 451 158 45 14 31 
150 649 541 108 111 62 49 
160 688 622 66 177 116 61 2 1 1 
170 593 563 30 248 195 53 5 4 1 
180 584 576 8 329 285 44 10 8 2 
190 453 445 8 364 336 28 AW 13 4 
200 286 284 2 309 299 10 26 24 2 
210 195 194 1 223 218 5 37 34 3 
220 93 91 2 130 130 38 36 2 
230 38 38 106 105 1 43 42 1 
240 15 15 37 36 1 34 34 
250 1 1 10 10 12 12 
260-269 2 2 2 2 
Total 5340 4282 1058 2132 1819 313 226 210 16 
Per cents (80%) | (20%) (85%) | (15%) (93%) | (7%) 
Means 164.4 192.7 2222 
Sigmas 30.9 25.2 20.5 
L. Q. 143 152 115 176 182 153 209 210 
Mdn. 163 fall 131 193 197 166 224 226 oe 
WA@: 185 190 148 208 212 183 238 238 216 
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TABLE 30 


DISTRIBUTIONS OF TOTAL SCORES ON NEW-TYPE REGENTS EXAMINATION IN GERMAN, 
JUNE, 1925, OF ALL STUDENTS, OF ‘‘PASSED’? AND OF ‘FAILED’? STUDENTS BY YEAR- 
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AND APPROXIMATE QUARTILE AND MEDIAN 


PARENTHESES SHOW APPROXIMATE PER CENTS OF EACH 
2 CLASS PASSED AND FAILED BY THE OLD-TYPE REGENTS EXAMINATION. 


German 
New-Tyrpr German IL GerMaN III German IV 
Exam. 
Toran Total Passed Failed Total Passed | Failed Total Passed Failed 
Score Popula- | Old-Type}Old-Type| Popula- |Old-Type}Old-Type} Popula- | Old-Type|Old-Type 
tion Ixam. xam tion oxam. Exam. tion Exam. Exam. 
50-59 4 1 33 
60-69 6 6 
id 10 2 8 
80 18 8 10 
90 36 19 AW? 
100 42 22 20 6 2 4 
110 47 31 16 ai 4 3) 
120 77 63 14 6 3 3 
130 69 57 12 it 1 6 
140 91 81 10 13 3 10 Dy 2 
150 127 114 13 37 21 16 3 1 2 
160 137 128 9 33 19 14 1 il 
170 126 123 3 54 45 9 4 1 3 
180 154 151 3 55 51 4 6 2 4 
190 151 149 2 79 67 12 11 10 1 
200 141 141 74 au 3 10 8 2 
210 93 93 72 70 2 28 27 1 
220 75 75 62 62 14 13 1 
230 47 47 67 67 24 24 
240 23 23 52 52 15 15 
250 6 6 35 35 8 8 
260 2 2 7A 7 1 1 
270-279 1 1 
Total 1482 1336 146 667 581 86 127 110 1% 
Per cents (89%) | (11%) (87%) | 18%) (87%) | 18%) 
Means L721 202.9 228.1 
Sigmas 39.1 33.6 25.4 
Le @: 147 154 96 180 189 145 205 212 160 
Mdn. 176 181 116 215 210 160 219 225 180 
WE may 201 203 141 230 a2 180 237 238 195 
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TABLE 31 


DISTRIBUTIONS OF TOTAL SCORES ON NEW-TYPE REGENTS EXAMINATION IN PHYSICS, 
Junz, 1925, OF ALL STUDENTS, OF “PASSED”? AND OF “FAILED”? STUDENTS BY YEAR- 
CLASSES, SHOWING MEANS AND SIGMAS, AND APPROXIMATE PER CENTS OF EACH 
CLASS PASSED AND FAILED BY THE OLD-TYPE REGENTS EXAMINATION. 


Physics 
New-TypPr Toran PAssED FAILED 
xX AM. POPULATION O.p-TyPE Oup-TyYPE 
ToTaL Scores Exam. Exam. 
125-129 3 3 
120-124 i a 
115-119 22 22 
110-114 43 43 
105-109 53 53 
100-104 110 109 1 
95-99 142 141 iE 
90-94 241 239 2 
85-89 278 276 2 
80-84 431 426 5 
75-79 465 456 9 
70-74 654 641 13 
65-69 713 689 24 
60-64 944 895 49 
55-59 968 898 70 
50-54 1136 1043 93 
45-49 1222 1104 118 
40-44 1240 1036 204 
35-39 1103 887 216 
30-34 1085 826 259 
25-29 836 577 259 
20-24 818 519 299 
15-19 563 348 215 
10-14 460 252 208 
5-9 274 122 152 
0-4 270 105 165 
Numbers. 14,081 Thy aay 2,364 
Percent. . 85% 15% 
Means . . 97.12 93.29 115.98 
Sigmas .. 23.0 22.15 16.77 


Distributions of scores on each Part, and on Parts I and II combined, 
of new-type tests. — Tables 32-34, inclusive, show distributions of scores 
on each Part of the new-type modern language tests by total year-classes. 
Table 35 shows distributions of combined Part I and Part IT scores, on 
the new-type modern language examinations for random samplings of 
each year-class. These last four tables are included because of their 
general usefulness to investigators, and for the convenience of teachers 
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who may administer only one or two Parts of the new-type tests, or who 
wish to compare the achievements of their own classes in each Part with 
state-wide norms for each Part. 


TABLE 32 


DISTRIBUTIONS OF SCORES ON EACH PART OF THE NEW-TYPE REGENTS EXAMINATION 
or JUNE, 1925, BY YEAR-CLASSES 


French 
Part I Parr II Part III 
VocaAaBULARY TruE-FALsE CoMPLETION 
CLASSES o 
1 
II eH | IV I Ill IV II Iil IV 
Scores 
-1-0-1 31 6 2 1 
2-3-4 25 2 6 3 
5-6-7 43 29 Pe: 
9 1 1 54 1 40 6 
12 5 1 80 9 52 i 
15 4 121 6 1 107 4 
18 4 2 144 15 136 5 
21 16 il 214 13 195 12 
24 26 1 381 25 1 306 19 
Ze 35 és 433 38 1 362 16 4 
30 86 3 675 52 564 37 1 
33 132 7 793 92 2 618 49 2 
36 241 8 1071 166 2 788 ae 1 
39 385 14 1051 186 5 974 116 3 
42 458 29 1054 369 12 1079 204 4 
45 856 44 1 1497 417 9 1141 244 6 
48 981 92 1 1895 720 Pai! 1139 377 11 
51 1291 122 4 1197 653 Del 1169 418 25 
54 1383 211 7 1097 1005 60 1057 579 20 
57 1452 304 7 1056 801 50 1018 703 35 
60 1390 455 11 441 909 82 750 733 36 
3 1282 576 18 370 594 71 678 764 33 
66 1085 724 22 173 436 87 466 659 50 
69 795 802 33 78 155 36 326 593 43 
72 680 780 41 27 50 22 279 444 50 
75 358 736 39 5 21 91 302 45 
78 281 632 50 76s! 17.2) 35 
81 116 530 70 30 115 38 
84 73 325 43 8 53 23 
87 48 201 47 5 26 15 
90 12 93 48 5 1 
93 6 36 37 21 1 
95-96-97 3 5 10 
98-99-100 1 
Numbers | 13486 6741 489 | 13486 6741 489 | 13486 6741 489 
Means 57.61 | 70.37 | 79.16 | 44.64 | 53.75 | 59.68 | 48.47 | 60.26 | 67.50 
Sigmas 11.05 | 10.20 | 10.26 | 11.85 9.65 8.18 | 13.5 11°55 | 11.91 
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TABLE 33 


DISTRIBUTIONS OF SCORES ON EACH PART OF THE NEW-TYPE REGENTS EXAMINATION 
or JUNE, 1925, BY YEAR-CLASSES 


Spanish 
Part I Part II Parr III 
CLASSES 
II lil IV II III IV II | III IV 
Scores 
2-3-4 2 3 
5-6-7 1 8 
9 2 18 1 
12 4 41 1 
15 1 6 82 6 
18 8 1S 3 
ill 2 16 2 140 vs 
24 Z Di Zz 223 20 
27 5 35 282 14 
30 6 57 3 323 33 
33 8 129 10 426 47 
36 13 167 14 445 77 1 
39 13 FE 24 398 98 2 
42 33 1 395 60 458 156 2 
45 63 3 472 79 5 596 158 4 
48 123 4 580 THe 5 388 189 5 
51 164 i 592 180 4 350 233 9 
54 194 7 697 289 13 262 192 10 
57 281 10 589 318 23 246 1 We 16 
60 349 28 591 337 30 243 181 #2 
63 440 44 1 293 255 at 156 130 13 
66 466 66 258 253 51 133 118 18 
69 492 90 3 99 108 29 94 79 25 
a 587 150 1 41 38 20 47 res. 16 
75 542 238 6 2 3 9 34 63 21 
78 495 249 6 27 28 22 
81 394 315 16 16 22 20 
84 297 271 18 5 19 10 
87 186 278 35 3 8 11 
90 97 184 48 2 8 
93 52 123 53 i 
95-96—97 31 49 22 
98-99-100 4 15 17 
Numbers 5340 2132 226 5340 2132 226 5340 2132 226 
Means OnE 80.9 89.69 | 51.68 07.66. | 63:82") 43:12 195305 | 60277 
Sigmas 11.438 8.79 6.57 9.69 7.83 6.81 14.16 12.78 |} 12.15 
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TABLE 34 


DIsTRIBUTIONS OF SCORES ON EACH ParT OF THE NEW-TYPE REGENTS EXAMINATION 
or JUNE, 1925, BY YEAR-CLASSES 
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German 
Parr I Part II Part IIT 
CLASSES 
li Ill IV II Ill IV II Ill IV 
Scores 
—1-0-1 il 
2-3-4 1 2 
5-6-7 1 Uh 
9 5 9 1 
12 6 21 4 
15 9 32 
18 9 1 34 55 
21 1 12 43 i) 1 
24 31 1 59 7 
27 1 3l 3 54 Ta 1 
30 2 46 3 68 19 1 
33 ] 50 4 72 20 1 
36 2 67 14 iL 74 29 
39 8 76 11 1 90 22 2 
42 10 1 107 20 if 90 34 4 
45 15 118 33 2 92 33 4 
48 29 140 49 86 34 2 
51 43 1 145 53 3 81 47 6 
54 38 3 154 59 8 90 39 2, 
57 52 8 142 > if 91 59 3 
60 51 2 137 89 15 70 39 6 
63 65 6 il 95 92 12 66 25 15 
66 77 8 1 61 87 46 64 29 7 
69 88 6 25 ol 17 49 30 8 
12, 85 19 ii 23 12 38 41 8 
75 115 29 2 z 2 2 32 35 13 
78 102 33 3 30 23 9 
81 132 50 3 20 14 5 
84 128 61 1 7 22 6 
87 132 63 9 6 21 9 
90 114 80 6 2 10 5 
93 97 105 20 3 7 6 
95-96-97 69 116 39 2 3 
98-99- 100. 25 76 42 
Numbers 1482 667 127 1482 667 127 1482 667 | 127 
Means 76.44 | 88.28 | 94.39 48.51 | 57.2 63.54 46.86 | 57.11 | 68.95 
Sigmas 14.1 9.77 6.64 8.21 9.45 7.03 iad 17.74 | 15.95 
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TABLE 35 


DISTRIBUTIONS OF COMBINED SCORES ON THE VOCABULARY AND READING PARTS 
OF THE NEW-TYPE EXAMINATIONS IN FRENCH, SPANISH, AND GERMAN OF RANDOM 
SAMPLINGS OF STUDENTS WHO TOOK THE REGENTS EXAMINATIONS IN JUNE, 1925. 


French Spanish German 


Scores 

Il Ill IV II III IV II III IV 
15-19 1 
20-24 3 
25 1 
30 6 2 
35 6 1 
40 9 1 1 
45 6 1 3 
50 17 3 5 
55 10 6 4 
60 23 7 7 
65 26 8 1 14 
70 43 1 1 10 1 LZ 
uo 55 +f 20 20 2 
80 1038 5 i 28 1 34 1 
85 123 11 1 41 Wy 45 4 
90 153 35 1 64 12 35 2 
95 169 49 5 88 14 62 5 1 
100 IA 77 5 122 10 54 8 
105 194 ie 18 139 27 60 5 
110 188 183 13 162 65 79 10 
115 170 228 25 193 84 91 16 
120 146 247 30 235 137 5 96 24 2 
125 137 248 33 211 182 6 95 45 if: 
130 109 226 39 213 236 2 94 46 1 
135 68 208 47 162 283 9 123 45 2 
140 39 144 53 140 290 25 112 46 2 
145 18 89 53 97 282 28 125 60 11 
150 13 60 48 52 225 41 76 68 3 
155 5 22 44 13 48 52 57 76 17 
160 I 6 24 8 The 34 47 74 21 
165 9 1 29 14 26 59 17 
170-174 1 2 10 7 16 5 
Numbers} 2013 1960 451 2028 2008 226 1389 612 83 
Means 105.78 | 125.14 | 138.32 || 120.65 | 137.55 | 153.14 || 125.15 | 145.42 | 157.43 
Sigmas 21.67 | 14.91 17.01 19.22 | 14.32 10.74 24.95 | 18.25] 12.69 


OVERLAPPING OF CLASSES 


Means and sigmas of class distributions presented graphically. — Chart 
16 is based on the data of Tables 28-31 and summarizes graphically the 
class-averages and variabilities of all French, Spanish, and German stu- 
dents who took the Regents examinations in June, 1925. The new-type 
tests are well adapted to the whole range of achievement in high school 
modern language work, and display the class-differences very well. The 
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differences between year-class averages are 33 and 23 score-points for 
French, 29 and 29 for Spanish, and 31 and 25 score-points for German 
classes. The sigma ranges show extensive overlapping of classes in all 
three languages. 

Fallacy of customary assumption underlying use of separate examinations 
for year-classes. —'The customary assumption regarding modern language 


‘¥} and classes 


120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 


‘Cuart 16.— Means and sigma-ranges of scores on new-type modern language 
examinations in French, Spanish, and German, of all students taking the Regents 
examinations in June, 1925. The sigma-range comprises about two-thirds of the scores 
in each class; the overlapping of classes is very apparent. 


students who are in separate year-classes is that they are separate and 
distinct species, so that students in French III, for example, are regarded 
as zpso facto entirely above French II students and entirely below those 
in French IV. This assumption is illustrated graphically in the following 
figure. 


Lowl High I 


0 25 50 75 100 


The horizontal line represents the total range of achievement in high school 
modern language work, from the beginning of French I to the end of 
French IV. The lowest French II students are still above the highest 


Low II High II | Low III High III | Low IV High IV 
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French I students, and the highest French II students are still below the 
lowest French III students. The examining problem is, therefore, one 
of merely differentiating among the best and poorest of each of the dis- 
crete groups; since, in accordance with the assumption, failure in one 
course can not mean anything worse than having to repeat the course, 
and similarly passing can not mean more than that the student is more 
or less well prepared to begin the next course. This assumption sounds 
quite absurd when stated thus baldly, yet it is the only assumption that 
in any way justifies the giving of separate examinations to each year-group. 
If the assumption were correct, then separate examinations, having the 
same relations on the scale of difficulty as the different classes have on the 
scale of achievement, might, presumably, be given with satisfactory results. 

It is clear from Chart 16, however, that whatever examinations or ad- 
ministrative procedures are responsible for assigning the students to the 
year-classes in which they took the Regents examinations in June, 1925, 
the classes were not formed in accordance with the assumption illustrated 
in the figure above. The classification is so poor that it is difficult to 
avoid the conclusion that teaching efficiency must have been seriously 
impaired, and constructive educational guidance of students rendered 
practically impossible. Since the sigma-range shows the range of scores 
of the middle two-thirds of each class, it is easy to see that the upper 
15% of French III students are considerably above the French IV aver- 
age, and that the lower 15% of the French IV students are well below 
the French III average. And so with the German and Spanish classes. 

“ School overlapping’’ distinguished from ‘‘ Regents overlapping.’’ — These 
overlappings, however, while they fairly represent the heterogeneity of 
classes under which both students and teachers labored during the session 
1924-1925, are not to be interpreted as the results solely of the Regents 
examinations. The sigma-ranges in Chart 16 are for ail the students in 
each class who took the Regents examinations in June, 1925. We may 
charge to the Regents examinations only such overlapping as remains 
after these students have been reclassified by the old-type parts of the 
Regents examinations of June, 1925. We shall present data on such 
residual overlapping, which is hereafter called “Regents overlapping,” 
after we have analyzed in more detail the ‘school overlapping”’ of Chart 16. 
For this purpose we shall present the distributions of Tables 28-31, in- 
clusive, in the form of percentile graphs. 

“School overlapping” displayed by percentile graphs. — The simplest 
and most effective description of the percentile curve and its uses has been 
prepared by Otis.t “A percentile curve is a smooth line having a hori- 


1Otis, A. S., Statistical Method in Educational Measurement, World Book Company, Yonkers, N. Y 
1925, pp. 58-67. This book cannot be too highly recommended to those who wish to Fone nate 
with the fundamentals of statistical methods in education. Cf. pp. 19-27, this volume, for confirmat 
evidence on overlapping of classes in junior high schools. ’ etd 
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zontal length representing 100 per cent of the scores of any group of indi- 
viduals and so drawn that any point on the curve has a height represent- 
ing the amount of a given score and a horizontal position on the graph 
representing the per cent of the scores of the group that is exceeded by 
the given score. A percentile curve shows at a glance not only the me- 
dian score of a class but also the range and variability of the scores. 
It shows at a glance just what per cent of the scores of the class is ex- 
,ceeded by the score of any given individual and just what per cent of the 
class attains or exceeds any given score. Two or more curves on the same 
graph show very vividly the amount of overlapping of the scores of different 
classes.” 

Charts 17, 18, and 19 are percentile graphs for all the students that took 
the Regents examinations in French, Spanish, and German in June, 1925, 
showing percentile curves based on the new-type scores of each class in 
each language. As a guide to the reader unfamiliar with such graphs we 
shall “read” in detail Chart 17, showing percentile curves of the new-type 
scores of second-, third-, and fourth-year French classes. 

How to read a percentile graph. — The lowest percentile curve in Chart 17 
is for French II, and the upper for French IV students. The first thing 
to notice, and keep in mind, is that the point at which a curve crosses 
the fifty-percentile line, i-e., the heavy vertical line headed ‘‘50,” represents 
the median score of the class; the value of this median score can be as- 
certained from the first column of figures at the left, headed ‘‘New-type 
seores.”’ The French II curve crosses the fifty-percentile line just. above 
the horizontal line corresponding to a score of 150; since the distance 
between each pair of horizontal lines represents 10 score-points, and the 
French II curve crosses at a point about two-tenths of the distance between 
the 150- and 160-score horizontal lines, we estimate that the median of the 
French II class is about 152 new-type score-points. Similarly we may 
estimate the French III median to be about 184, and the French IV me- 
dian to be about 208 score-points. (Since the distributions are slightly 
skewed, the medians differ slightly from the mean scores indicated in 
Chart 16.) 

One-third of students in French classes misplaced by one year or more. — 
Applying a straight-edge, which is kept constantly parallel to the hori- 
zontal lines in the chart, to the point at which the French II curve crosses 
the fiftieth-percentile line, we may ascertain the per cents of French III 
and French IV students who receive scores below the French II average 
by noting the points at which the straight-edge cuts the French III and 
IV curves, and reading the per cents vertically above these points. The 
straight-edge when so applied cuts the French III curve just to the right 
of the 10th-percentile (vertical) line, and cuts the French IV curve just 

1See above, pp. 20 to 27. 
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to the left of the 5th-percentile line; we thus perceive that about 12% 
of the French III and about 4% of the French IV students are below the 
French II average. It appears, therefore, that of the 489 students who 
had studied French in New York State high schools five periods per week 
for four years, about twenty are so far from being fourth-year students 
in actual achievement that they would be scarcely respectable, academi- 
cally, as members of the second-year class! How these students were 
ever admitted to the fourth-year class is a mystery which we shall endeavor 
to explain in a later section of this report. 

Applying the straight-edge to the point at which the French IIT curve 
crosses the fiftieth-percentile (vertical) line, and keeping it constantly 
parallel to the horizontal lines of the chart, we observe that it cuts the 
French II curve on the eighty-fifth-percentile line; subtracting 85 from 
100, we learn that about 15% of the French II students are above the 
average of the French III class. Holding the straight-edge in this posi- 
tion, we observe (at the left) that it cuts the French IV curve just to the 
right of the twentieth-percentile line; whence we estimate that about 
22% of the French IV students are below the French III average in actual 
achievement in French. 

Keeping the straight-edge constantly parallel to the horizontal lines 
of the chart and sliding it up to the point at which the French IV curve 
crosses the fiftieth-percentile line, we note that the straight-edge cuts 
the French III curve just to the right of the eightieth-percentile line, and 
cuts the French II curve just to the right of the ninety-fifth-percentile 
line. Estimating the exact per cents, and subtracting from 100, it ap- 
pears that about 17% of the French III and about 3% of the French II 
students are above the fourth-year average. We have no means of di- 
rectly measuring the misplacement of students at the lower end of the 
Freneh II and at the upper end of the French [V*curves; but since about 
22°, of the French IV students are below, and about 15% of the French 
II students are above the French III median, and since only about 70% 
of the French III students are between the French II and French IV 
medians, it seems fair to say that at least 30% of all French students are 
misplaced in one direction or the other by one year or more, and about 
3 or 4% by two years or more. 

Sixty per cent misplaced by as much as one semester. — What proportion 
of the students are misplaced by as much as one semester? We may 
answer this question by applying the straight-edge to the point midway 
between the points at which the French II and III curves cross the fiftieth- 
percentile line, and to the point midway between the French III and IV 
median points. The median for French IT is 152 and for French III 184; 
the point midway between these two is 168. Applying the straight-edge 
to this point while keeping it parallel to the horizontal lines, we learn 
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by reading the per cents vertically above the points at which it cuts the 
French IV, III and II curves, that about 11% of the French IV students 
are nearer the French II median than to a higher class median; that 
about 25% of the French III students are nearer to the French II median 
than to a higher median; and that about 32% of French II students are 
nearer to a higher median than to their own class median. 

Sliding the straight-edge up to the point midway between the median 
points of French III and IV students, and reading from left to right above 
the points at which it cuts the three curves, we learn that about 33% 
of French IV students are nearer to a lower class median than to their 
own, that about 35% of French III, and about 9% of French II students 
are nearer to the French IV median than to a lower class median. Of 
the French III students, therefore, about 65% are nearer to a lower or 
higher class median than to their own; less than 40% are closer to the 
French III median than to some other class median! From these facts 
it seems safe to conclude that more than 60% of all French students in 
New York State secondary schools are misplaced by one semester or more. 
This condition is very near to chaos, so far as classification for instruc- 
tional purposes and so far as educational guidance are concerned. It is 
an eloquent portrayal of the crucial need for a centrally administered 
system of examinations capable of affording accurate measures of defined 
achievement expressed in terms of comparable units and reckoned from 
stable and uniform standards. The overlapping here displayed is what 
we have called “school overlapping”; obviously, any examination system 
which depends on local ratings from schools must accept such overlapping 
of classes and misplacement of students to the full extent of such dependence. 

Charts 18 and 19, showing percentile graphs for Spanish and German 
classes, are in every respect similar to Chart 17, and may be read in the 
same way. ‘They are, therefore, presented without comment. 

Regents overlapping. — In order to measure the overlapping caused, or 
allowed, by the Regents examinations we must consider only those 
students who, according to old-type Regents examinations of June, 1925, 
belong in each of the three classes; that is, those students in each class 
who are passed by the Regents marks or by marks which the State Depart- 
ment readers have accepted. Charts 20, 21, and 22 show percentile 
curves for each of the three classes in each modern language. On account 
of space limitations we shall describe only the percentile graph for the 
French classes, leaving it to the reader to see how closely the indications 
of Charts 21 and 22 parallel those of Chart 20. 

Percentile graphs for students passed by Regents review ratings. — The 
points at which the three curves cross the fiftieth-percentile line in Chart 20 
indicate that the median scores of the passed students of the three French 
classes are about 160, 189, and 213 in order. Applying a straight-edge 
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to the point at which the French II curve crosses the fiftieth-percentile 
line, keeping it parallel to the horizontal lines of the chart, and noting 
the points at which it cuts the other two curves at the left, we learn that 
(in terms of new-type scores) about 10% of the students which the Re- 
gents examinations classify as having passed French II are below the 
average of those classified by the old-type Regents examinations as having 
passed French II, and that about 2% of the students passed in French IV, 
and thus declared by the Regents examinations to be prepared to study 
French literature in college, are below the average of students passed in 
French II! It is thus manifest that while the Regents examinations do 
somewhat reduce the school overlapping, they fall far short of classifying 
and promoting students according to actual achievements. The Regents 
examinations promote to college classes in French literature some stu- 
dents who are unable to recognize the meanings of French words, to read 
simple French sentences, or to complete very simple French phrases, as 
well as these things are done by the average students who, according to 
the Regents examinations, have passed French II. One in ten of the 
students promoted by the old-type Regents examinations to French IV 
is below the average of those promoted only to French III. This failure 
of the Regents examinations to do the very thing for which they were 
primarily instituted, and which is so essential to the educational progress 
of the state, is inherent in the form of examinations used, and is eloquent 
of the need for fundamental changes in the Regents examination system. 

Applying the straight-edge to the point at which the French III curve 
crosses the fiftieth-percentile line and reading the per cents vertically 
above the points at which it cuts the French IV curve at the left, and the 
French II curve at the right we find that about 18% of students passed 
in French IV are below, and about 16% of the students passed in French IT 
are above the median of students passed in French III. Sliding the 
straight-edge up to the French IV median point and reading vertically 
above the points at which it cuts the French II and III curves, it appears 
that about 14% of French III passed students and about 2% of French II 
passed students are above the average of students passed by the old-type 
part of the June, 1925, Regents examination in French IV. 

Applying the straight-edge to the point on the fiftieth-percentile line 
midway between the French II and French III medians, keeping it parallel 
to the horizontal lines, and reading above the points where it cuts the 
French HI and French IV curves, we learn that about 8% of the passed 
French IV students are nearer the French II median than to a higher 
median, that is, are misplaced by at least one year or more; and that 
about one-fourth of passed French III students are nearer to the French 
II median than to their own class median. Holding the straight-edge in 
this position and reading at the right, we learn that about 33% of passed 
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French IT students are nearer to the average of passed French III or 
French IV students than to their own class average. 

Sliding the straight-edge up to the point midway between the French 
III and IV median points and reading at the left we learn that about 
30% of French IV students passed by the old-type Regents examination 
are nearer to a lower class median than to their own; and, reading at the 
right, we learn that about 32% of passed French III students and about 
8% of passed French II students are nearer to the French IV median 
than to any other class median. 

Regents overlapping nearly equal to school overlapping. — Summarizing, 
it appears that by the old-type Regents examinations in French about 
five or six per cent of all French students are misplaced by two years or 
more, about 33% by one year or more, and more than 60% by one semester 
or more. Less than 40% of the students as classified by the Regents 
old-type examinations in French are nearer to their own class median 
than to some other class median. On the whole, this is almost as bad as 
the classification based on local school ratings of the old-type Regents 
examinations. 

That such failure to classify reasonably well is thoroughly character- 
istic of the Regents examination system as it operates with old-type exami- 
nations is indicated by Charts 21 and 22, which show similar overlappings 
for Spanish and German classes. These charts lend almost incontrover- 
tible support to the analysis of old-type examinations offered above on 
pages 133 to 136. 

Percentile graphs for students failed by Regents review ratings. — One 
would naturally expect that most of the students failed by the old-type 
Regents examinations would be at the lower end of the distribution of 
new-type scores in each class. The amount of ‘Regents overlapping”’ 
just described shows that, although a sufficient number was failed in 
each class to reduce materially the overlapping, the old-type Regents 
examinations actually did not distribute the failures in such a way as to 
improve very much on the poor classification effected by the local school 
ratings. Charts 23, 24, and 25 show strikingly the indiscriminate char- 
acter of the old-type failures. 

A full appreciation of Chart 23 requires that it be compared in detail 
with Charts 17 and 20; but in itself it presents a striking picture. It is 
clear from Chart 23 that many students who are failed in French are 
above the average of the class in which they were failed and in a few cases 
are even above the average of the class above that in which they were failed. 
For example, if the reader will recall that the median of all French IV 
students was 208, Chart 23 shows that a little over 1% of the students 
failed in French III are above the average score of all French IV stu- 


dents. Even more striking is the fact that a few students were failed in 
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French II who were above the French IV average. The educational 
significance of such errors in grading can hardly be overestimated. Charts 
24 and 25 are similar to Chart 23 and indicate that the differences between 
the three languages are so slight as to be negligible. The educational 
welfare of the students in all three languages demands that the Regents 
examinations be materially strengthened. Table 36 is presented as a 
compact numerical summary of the main aspects of the overlapping set 
forth in Charts 17-25, inclusive. 


TABLE 36 


OVERLAPPING OF MODERN LANGUAGE CLASSES, SHOWING THE PER CENT OF EACH 
CLASS THAT EXCEEDS THE NEXT HIGHER CLASS AVERAGE, AND THE PER CENT THAT FALLS 
SHORT OF THE NEXT LOWER CLASS AVERAGE ACCORDING TO NEW-TYPE SCORES. THE 
PER CENTS IN ODD-NUMBERED COLUMNS APPLY ONLY TO STUDENTS PASSED BY THE 
DEPARTMENT; THE OTHER FIGURES RELATE TO ALL STUDENTS THAT TOOK THE NEW-TYPE 
TESTS IN EACH CLASS IN JUNE, 1925. THE CLASS MEANS USED IN EVEN-NUMBERED 
COLUMNS ARE OF THE TOTAL NUMBERS OF STUDENTS IN EACH CLASS, AND IN ODD-NUMBERED 
COLUMNS OF STUDENTS PASSED BY THE DEPARTMENT REVIEWERS. 


Per Cents or STupENTS IN CLASSES INDICATED IN COLUMN (1) WHO ARE 


Above 38rd- Above 4th- Below 2nd- Below 3rd- 
Year Mean Year Mean Year Mean Year Mean 
Total Passed Total Passed Total Passed Total Passed 
(1) (2) (3) (4) (5) (6) (7) (8) (9) 
iRrenchy Lhe aes ee 15.0 15.0 4.0 2.0 
Livy 22.0 14.8 10.0 9.7 
IV 3.0 2.0 22.0 eo 
N) shige a 18.0 16.4 2.0 
a Pie 10.0 14.0 12.0 13.0 
Vee 1.0 10.0; 11.0 
x (hoe oe 23.0 18.4 6.0 7.0 
acs Ill : 45.0 30.0 18.0 18.0 
LV. 6.0 2.0 24.0 23.6 


Percentile graph for new-type Physics test scores of total, passed and failed 
groups. — Chart 26 shows percentile curves for all the students who took 
the Regents examinations in physics in June, 1925; for all students who 
were passed on the old-type part of the Regents examination in physics, 
and for all students who were failed by the old-type part of the Regents 
examination. The inaccuracy of the grades based on the old-type part 
of the Regents examination is clearly indicated. Many students who are 
far above the average are failed by the old-type, while many students 
who are obviously failures are given credit by the old-type examination. 
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Overlapping of classes displayed by frequency polygons. — Thus far 
we have attempted to show the character and extent of overlapping of 
classes due to the inaccuracies of the old-type parts of the Regents exam- 
inations by means of percentile curves. While the percentile curve is 
peculiarly adapted to display overlapping, it is by no means the only 
device by which extensive overlapping may be exhibited. This is espe- 
cially true for readers who are not used to reading percentile curves. 
For the sake of all readers of this report, but especially for those who 
may be unfamiliar with the percentile graph, Charts 27 to 29 inclusive 
are presented. 

The frequency polygons of Chart 27 display in a most striking manner 
the overlapping of classes before and after the application of the old-type 
parts of the Regents examination in French. The line of figures at the 
bottom of the chart reading from left to right refers to scores on the 
new-type test in French. The figures reading up at the left represent the 
numbers of students. The curves at the bottom of Chart 27 are for all 
students in the second-, third-, and fourth-year classes, respectively, who 
took the Regents examinations in June, 1925. The solid line curve 
shows that the scores of the second-year students range from 10 to 240, 
those of third-year students from 20 to 250, and those of fourth-year stu- 
dents from 90 to 250. The vertical lines show the means or averages of 
the three classes. 

The curves at the top refer only to students in these three classes who 
passed the old-type part of the Regents examinations in French in June, 
1925. The very slight effect which the old-type Regents examination 
had on the forms of the total distributions and on the average scores of 
the three classes is self-evident in Chart 27. The spread of the scores 
in the three classes is scarcely reduced at all and the overlapping of classes, 
as has already been set forth in connection with Charts 20, 21, and 22, is 
only very slightly reduced. 

Charts 28 and 29 are very similar to Chart 27 and should be read in 
the same manner. 

In order to make the facts of overlapping of classes as clear as possible 
and to show the evidence on the effectiveness or ineffectiveness of the old- 
type Regents examinations in as many lights as possible, we present the 
facts of the preceding charts in another series of frequency polygons in 
Charts 30 to 33. 

The advantage of Chart 30 is that corresponding groups in the three 
classes may be directly compared. The three frequency polygons at the 
top of Chart 30 refer to total, passed and failed groups of French IT stu- 
dents. The curves at the bottom of Chart 30 refer to total, passed and 
failed groups of French IV students. All nine frequency polygons of 
Chart 30 are set up over the same scale of scores, as indicated under the 
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Cuartr 30. — Graphs of distributions of new-type scores of total population, passed 
and failed groups of each French class. The passed and failed groups consist of students 
passed or failed by the old-type parts of the Regents examinations. Scores are repre- 
sented on the abscissa. 
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base lines of the polygons. By placing a straight-edge vertically on the 
chart and sliding it from left to right the reader may make the same 
comparisons as were made by means of the percentile curves described 
above. For example, it is obvious from Chart 30 that at least a few 
students who were failed in second-year French received higher scores on 
the objective test than half of the students who were passed in fourth-year 
French received. It is also clear from Chart 30 that some students who 
were passed in fourth-year French received scores lower than the average 
score received by students who were failed in second-year French. This 
is evidence of the signal failure of the Regents examination system which 
cannot by any reasonable means be controverted. Even though the 
reader may still be doubtful of the validity of the new-type examinations, 
the differences in standards set forth by Chart 30 are so great that they 
cannot be explained away by adverting to the possible lack of validity 
in the new-type scores. In other words, ‘the force of the evidence here 
presented does not depend upon an acceptance of the new-type scores as 
completely valid measures of modern language achievement. This fact 
is mentioned at this time merely for the sake of clearness, because it is 
scarcely conceivable that any one who has examined all the data pre- 
sented above can still entertain any reasonable doubts about the validity 
of the new-type examinations. If any such doubts do linger in the minds 
of some readers, it is confidently believed that they will be dispelled before 
the end of this report is reached. 

Charts 31 and 32 are similar to Chart 30 and should be read in the 
same manner. Chart 33 shows frequency polygons for total, passed and 
failed groups of students who took the Regents examination in physics 
in June, 1925. The flattest curve of the three is a graph of the distribu- 
tion of new-type scores of students failed by the old-type part of the 
Regents examination. It shows that some students who were in the 
highest 10% of the whole class were failed by the old-type part of the 
Regents examination. The middle curve shows that many students were 
passed by the old-type who according to the new-type scores are in the 
lowest 10% of the class in terms of real achievement. Here again the 
force of the evidence against the old-type physics examination does not 
depend on the acceptance of the new-type examination as completely 
valid. That all the new-type tests referred to in this report are valid 
to a certain extent cannot be doubted by any reasonable person who ac- 
cepts the validity of the old-type, since the new-type examinations in 
every case correlate with the old-type as well as, or better than, the old- 
type results correlate with themselves. When all the evidence, both 
experimental and a priori, is considered, it seems clear that the new-type 
tests are as much more valid, as they are more reliable, than the old-type; 
but for the sake of clearness it should be said that the evidence of Charts 
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Cuart 31. — Graphs of distributions of new-type scores of total population, passed 
and failed groups of each Spanish class. The passed and failed groups consist of students 
passed or failed by the old-type parts of the Regents examinations. Scores are repre- 
sented on the abscissa. 
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Cuart 32. — Graphs of distributions of new-type scores of total population, passed 
and failed groups of each German class. The passed and failed groups consist of students 
passed or failed by the old-type parts of the Regents examinations. Scores are repre- 
sented on the abscissa. 
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17 to 33 inclusive would be convincing as to the inaccuracy of the old-type 
examinations even if we knew that the new-type examinations were in 
each case barely equal in validity and reliability to the old-type exami- 
nations. 

Forms of distributions are evidence against old-type examinations. — In 
the preceding charts the reader is able to see that the distributions of 
new-type scores conform closely to the normal curve. It is generally 

_ held by those entitled to an opinion that the educational achievements 
All students taking Regents New Type Exam. June 1925 


Se ee bassed’ on. Old mype Ehysicsie gs oe 
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CuHartT 33. — Graphs of distributions of new-type Physics test scores, of all students 
taking the Regents physics examination in June, 1925, of students passed and of students 
failed on the old-type part of the examination. 


of students are distributed approximately according to the probability or 
normal curve. We know from actual experience that most of the physi- 
cal measurements of school children, such as of height, weight, sitting- 
height, length of fore-arm, etc., are distributed approximately in the form 
of the normal probability curve. Accordingly, the form of distribution 
of scores given by an examining device can be used as evidence of the 
general validity of the examination. We know, for example, that it is 
just as contrary to reason to expect that students are divided into two 
discrete groups with reference to achievement in French, as it is to expect 
that they are divided into two discrete groups with reference to height. 
Any yard-stick which would give a U-shaped distribution of the heights 
of high school juniors would certainly be suspected immediately by all 


competent teachers. . 
To divide high school juniors into discrete passed and failed groups is 
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Cuart 34. — Graphs of distributions of old-type examination grades by classes. 
These are graphs of ‘Regents grades” or of grades from the schools which have been 
accepted by the Regents reviewers. Percentage grades are represented on the abscissa. 
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in general terms about as useful to the teachers of the state as dividing 
the same students into discrete tall and short groups would be to the 
tailors of the state. The proportion of maladjustments would necessarily 
be about as great in one case as the other if the true distributions in the 
two cases are approximately of the same form. Chart 34 shows a very 
pronounced tendency towards U-shaped curves for all of the old-type 
examinations in the three modern languages and in physics. Any new- 
, type examination which gave such distributions would immediately be 

rejected. The shapes of the curves given by the old-type examination 
grades may be explained in various ways, but such explanations would 
appear to be only added evidence against the old-type form of examination 
as the sole criterion of achievement in modern languages and in physics. 
It is easy to see, for example, that the sharp dip in the curve between 
percentage grades 70 and 75 is an artificial depreciation which is resorted 
to on the frank assumption that college admission officers do not like to 
make close border-line decisions. 

Regents examinations should be strengthened, not abandoned or cur- 
tailed because of their weaknesses. — Needless to say, the evidence here 
adduced from the forms of the new- and old-type distributions is not 
sufficient alone to be very significant one way or another, but in con- 
junction with all the other evidence presented in the report, it goes to 
show that the Regents examinations need to be strengthened in order 
that they may more adequately meet the crucial need for which the 
Regents examinations were originally instituted. The proposal made by 
some critics, that the Regents examinations, in view of their weaknesses, 
ought to be abandoned is given no support whatever by the data of this 
report. On the contrary, our data show clearly that abandonment or 
weakening of the Regents or College Entrance Examination Board exami- 
nation systems would be a retrogression toward complete chaos. The 
variations and inaccuracies of local school ratings would almost certainly 
be more anarchical and meaningless than at present if the moral influ- 
ence of such examining agencies on standards were withdrawn. What we 
need is to make this influence more accurate and just in its actual opera- 
tion than it is with the present subjective examinations used by these 


agencies. 


VARIABLE STANDARDS OF ScHooLs Not EQUALIZED BY 
REGENTS EXAMINATIONS 


The standards in individual schools and in groups of schools are, as 
has already been shown, quite variable. One of the purposes of the 
Regents examinations is to maintain high and uniform standards of 
achievement in all subject matters and in all schools in the state. Our 
data show that the Regents examinations fail to achieve this desirable 


184 NEW-TYPE MODERN LANGUAGE TESTS 


ewZis jo jUusIDIIdg — 
+ ! iS + d PS 9 9 2 
s 3 i 4 a N 
a mee | | ! 1 
S\N 
o};™ 


a 

a 
> 

F\s 
=\|o2 
o| wo 
zl\ac 
=|= 


Le UM oo 


862 806 


791 


337 423 696 


Middle 
H.S. 


819 30-50 


-15 


591 15-29 


58 
369 sree Acad. 


Oo oO 
w 


100 


o} 
wo 
| 


-100 
-150 
— 200 


+ BwiZis jo JUsDI0dg — 


Cuart 35. — Variability of Regents standards in French II. The horizontal line 
from which the vertical bars originate represents the superimposed state-wide means, 
or averages, of old- and new-type scores. Hach pair of bars refers to the students from 
one school, or a group of schools. (See Table 21, above.) The black bar shows the 
average new-type score expressed in terms of standard deviation of new-type scores and 
the hatched bar shows the old-type average expressed in terms of the standard devia- 
tion of old-type scores of all French II students. All the bars in the chart are there- 
fore directly comparable with one another. The number of cases is indicated at the 
end of each pair of bars. 
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purpose in all of the four subject matters and in each of the modern 
language classes with which this experiment is concerned. Space limita- 
tions do not permit us to display all the instances in which this failure 
to maintain standards is apparent. We shall therefore take only a meagre 
sampling and confine our illustrations to second-year French and _ to 
physics examination results for several large individual schools and classes 
of schools. 

, Variable standards displayed by comparing new- and old-type average 
scores of French II students in individual schools and groups of schools. — 
Chart 35 shows for fifteen schools or groups of schools the average scores 
achieved by students on the new-type French examination and on the 
French: IT old-type Regents examination expressed in terms of deviations 
from the state-wide French II average. The horizontal line from which 
all the bars in Chart 35 originate represents the state-wide average on 
both new- and old-type French examinations. The length of each black 
bar shows the average new-type score of a school or group of schools, ex- 
pressed in terms of the state-wide standard deviation of new-type scores, 
and the length of the adjacent cross-hatched bar shows the average score of 
the same school or group on the old-type French examination, expressed 
in terms of the state-wide standard deviation of the old-type scores for 
French II. Therefore all the bars in Chart 35 are directly comparable. 

In general there is considerable agreement between the black and 
cross-hatched bars, but there are several disagreements. Thus school 
591, with 59 students in French II, is below average according to the 
new-type but is somewhat above average according to the old-type 
examination. Schools 696 and 862 have their positions directly reversed 
by the new- and by the old-type results. The most striking difference 
between old- and new-type results shown in Chart 35 is that of school 
369, with fifty-eight students. According to new-type results the French 
II students in this school average more than two standard deviations 
below the state-wide average. According to old-type scores they average 
less than one standard deviation below the state-wide average. All these 
disagreements are almost certainly due to the subjectivity and lack of 
comparability of the old-type examinations, aggravated by the sampling 
method of reading to which the Department reviewers in Albany are 
restricted. 

Variable standards displayed by physics test results. — The differences 
displayed in Chart 36 for new- and old-type physics examination results 
are even greater than those set forth in Chart 35. Schools 821 and 727 
are below average according to new-type and above average according 
to old-type. School 918 is above average according to new-type and 
below average according to old-type. School 863 according to new-type 
is barely above average and according to old-type is almost one-half 
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Standard deviation above average; while 819 according to new-type is 
far above average, but according to old-type is a little below average. 
School 884 is in much the same situation as school 819. These two 
schools read their physics papers with great severity, and, because of the 
sampling method of reviewing, their severe reading was accepted by the 
State Department without mitigation. Schools 821, 727 and 863 read 
their papers much more leniently and their leniency was not stiffened 
by the Department reviewers. 

In both these charts it is evident that the large schools in general aver- 
age above the state-wide average, and the smaller schools and academies 
below average. The lowest school in both French II and physics is No. 
369, a private school, while one of the highest schools in both charts is 
No. 862, which is a large high school in New York City. 

Variable standards displayed by comparison of new-type average scores 
of second- and third-year French classes in individual schools and groups 
of schools. — The variability of the Regents standards is further dis- 
played in Charts 37, 38, and 39, which show for individual schools and 
groups of schools the new-type average scores of second- and third-year 
French students. Chart 37 shows the averages of all the students in 
these schools that took the Regents French examination in June, 1925. 
The horizontal lines from which the bars originate represent the second- 
and third-year state-wide averages. The third horizontal line at the top 
represents the fourth-year French state-wide average. The length of each 
bar above or below one of the average lines indicates the average new- 
type score of all the students in the school or group of schools described 
at the bottom of the chart. The bars are not in terms of standard devia- 
tion but represent raw new-type scores. The presence of the three aver- 
age lines, however, enables the reader to see at once the extent of the 
differences between the achievements of students in the same year-class, 
but in different schools. The figures at the end of each bar show the 
number of students whose average new-type score the bar represents. 

For example, the bar in the lower left-hand corner of the chart shows 
that the 358 French II students from the senior schools of the state had 
an average new-type score of approximately 139, which is eleven points 
below the French II state-wide average; the long bar rising from the second- 
year average line at the extreme right end of the chart shows that the 
241 French II students in school 971 had an average new-type score of 
about 181, which is 31 points above the state-wide French II average, 
and only two points below the state-wide French III average. The 
relative positions of the group of senior schools and of school 971 is 
practically the same with regard also to French III students. 

Chart 38 shows to what extent the Regents examinations took account 
of the differences displayed in Chart 37. This chart shows the new-type 
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average scores of students who were passed by the old-type Regents 
examinations in French II and III. Without attempting to notice all 
of the interesting features of Chart 38, it is sufficient to observe that the 
average of the French II students in school 971 who were passed by the 
old-type Regents examination in French II is actually above the state- 
wide average for French III and is ten points higher than the average 
new-type score of the students from the senior schools who were passed 
im French III. The students passed in French II from school 971 are 
more than a year above the average of the students passed in French II 
from school 591. The average new-type scores of the French II passed 
students from four schools are nearer to the state-wide French III aver- 
age than to the state-wide French II average. The averages of the 
French ITI passed students from three schools are above the average of 
the French III passed students from the group of senior schools. It 
is apparent from Chart 33 that the old-type examinations have signifi- 
cantly failed to describe the achievements of students from different 
schools in terms of uniform standards. Not only this, but the old-type 
examination in French II has given only French II credit to a group of 
students who in reality are above the state-wide French III average. 
This is a notable example of the weakness of a system of separate examina- 
tions for each year-class. The fundamental weakness of such a system 
is that it assumes beforehand, and in contravention of all available evi- 
dence, that no student in a given class could possibly deserve to pass 
in any higher class or to fail in any lower class. Under the Regents rules 
the rigidity of the separate examination system is somewhat mitigated 
by special dispensations to a few students, on the recommendation of 
their teachers, whereby they are allowed to take, for example, the French 
III examinations when they have had only two years of French classwork, 
Chart 38 shows clearly the inadequacy of this meagre and reluctant provi- 
sion for individual differences. 

Chart 39 is parallel to Charts 37 and 38 and shows the average new-type 
scores of French II and III students (from the same schools and groups 
of schools) who were failed by the old-type Regents examinations. The 
average new-type scores of students from senior schools and academies 
who were failed in French III are far below the state-wide average of 
French II students, while the students failed in French II from schools 
884, 696, and 971 secured average new-type scores almost equal to the 
new-type French III average. In other words, the students who were 
failed in French III from the senior schools and academies are more than 
a year below those failed from some other schools. The students from 
the senior schools and academies who were failed in French II are con- 
siderably more than a year below the average of the students failed in 
French II from school 971. The failing standard thus varies even more 
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Cuart 40. — Showing for indicated schools the medians and interquartile ranges 
of new-type scores in French II. The school identifications appear at the left, the num- 
bers of passed and of failed students are indicated just above the interquartile range 
lines, and the numbers of both passed and failed students are indicated at the right. 
The vertical dash-dot line indicates the state-wide average of all French II students, 
both passed and failed. 


than the passing standard. In school 971 eleven students were failed 
in French III who are almost exactly at the French III state-wide aver- 
age and forty-six students were failed in French II who averaged five 
points above the state-wide French II average. The average of the stu- 
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dents failed in French II in schools 862, 806 and 971 are above the aver- 
ages of the students failed in French III from the senior schools and 
academies. 

The variability of standards displayed in these three charts is in reality 
greater than appears because in these charts we are dealing only with 
averages, and not with extreme individual variations. 

Looking again at Chart 37 one would expect that a much larger propor- 
tion of the students at the left of the chart would be failed than of the 
students from the schools at the right end of the chart. Comparing the 
numbers at the ends of the bars in Chart 37 with the numbers at the end 
of the corresponding bars in Chart 39, we may ascertain to what extent 
the Regents examinations fulfilled this expectation. About 40% and 32% 
of the French II students from the two lowest school groups in Chart 37 
were failed by the old-type part of the Regents examination; the corre- 
sponding proportions for the four schools in the middle of the chart ranged 
from 10 to 22%; but of the students from school 971, 19% were failed in 
French II. In other words, almost a fifth of the French II class in school 
971, which in reality was barely below the French III state-wide average, 
were failed by the old-type part of the Regents examination, while con- 
siderably less than a fifth of the French II students from schools 919, 337 
and 884 were failed. Such facts as these hardly need comment. 

Variability of Regents standards only one aspect of fundamental inac- 
curacy of old-type subjective examinations. — The variability of the 
Regents standards is partly due to the fact that the State Department of 
Education accepts a part of the school ratings without review. That is, 
many school ratings are accepted by the Department readers on the basis 
of the review of a relatively small sample of school ratings. The stand- 
ards of different readers without central supervision naturally vary con- 
siderably, but that there is also a considerable element of inaccuracy 
behind the variations displayed in Charts 37 to 39 is indicated in Charts 
40 to 42, which show the interquartile ranges of the passed and failed stu- 
dents from several different schools. 

Chart 40 shows that the middle 50% of the students failed in school 954 
are entirely above the middle 50% of the students passed in school 931. 
More than a quarter of the students fazled in school 954 are above the 
average of those passed in that school by the old-type examination. In 
school 973 more than a quarter of the students failed achieved higher 
scores than the lower third of those that passed. Nearly three-quarters 
of the students failed in 954 are above the state-wide average and about 
a third of the students passed in school 931 are below the lower quartile 
of all French II students who took the Regents examinations in June, 1925. 

Charts 41 and 42 are so similar to Chart 40 both in form and indica- 
tions that they are presented without comment. 
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Cuart 41. — Showing for indicated schools the medians and interquartile ranges 
of new-type scores in French III. The school identifications appear at the left, the 
numbers of passed and of failed students are indicated just above the interquartile 
range lines, and the numbers of both passed and failed students are indicated at the 
right. The vertical dash-dot line indicates the state-wide average of all French III 
students, both passed and failed. 
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Cuart 42. — Showing for indicated schools the medians and interquartile ranges 
of new-type scores in Spanish II. The school identifications appear at the left, the 
numbers of passed and of failed students are indicated just above the interquartile range 
lines, and the numbers of both passed and failed students are indicated at the right. 
The vertical dash-dot line indicates the state-wide average of all Spanish II students, 


both passed and failed. 
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Cuart 43.— Showing ‘unevenness’? of achievement in modern language classes 


with respect to vocabulary, silent reading, and grammar. Only second-year classes are 
considered because of small numbers of students in third- and fourth-year classes. The 
bars are in terms of state-wide standard deviations, and the horizontal lines from which 
they originate represent superimposed state-wide means of the three parts of the new- 
type examinations. Each triad, of bars represents one school, and the bars in each 
triad represent the parts of the new-type tests in order, from left to right, vocabulary, 
silent reading, and grammar. 
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Unevenness of achievement in individual schools with respect to vo- 
cabulary, reading and grammar. — From the correlations between the 
Parts of the new-type tests presented in earlier sections of this report, it 
is clear that achievement of individual students is very uneven with 
respect to vocabulary, reading and grammar. That is, a student may 
rank very high with respect to grammar and very low with respect to 
vocabulary, or he may rank very high in reading ability and very low in 
grammar. It is of interest to know whether similar differences exist as 
between different schools. Is grammar, for example, emphasized rela- 
tively more in one school than in another? Are certain schools high in 
reading and low in grammar achievement? Chart 43 has been prepared 
with a view to answering such questions. 

The horizontal lines from which the vertical bars originate represent 
the superimposed state-wide means of the three Parts of the new-type 
examinations in French, Spanish and German. Each triplet of bars repre- 
sents one school and the bars in each triplet represent the Parts of the 
new-type tests in order. The length of the bar above or below the mean 
indicates the average of the school on the given part, expressed in terms 
of the state-wide standard deviation of scores on that part of the new- 
type test. All the bars originating from the same horizontal line are 
therefore directly comparable. While there are no very great differ- 
ences, there are several which seem significant. Compare, for example, 
schools 884 and 337 in French II. These two schools are almost at par 
with respect to vocabulary but school 884 is a little over three-tenths of 
a standard deviation below and school 337 is .15 of a standard deviation 
above the state-wide average in grammar. Only one school is equal in 
all three parts, that is school 696. Only two of the schools chosen for 
Chart 43 are higher in grammar than in any other part, schools 337 and 
862. 

The differences for Spanish II are somewhat greater than those for 
French II. School 804 is considerably better in grammar than in either 
vocabulary or reading; while school 917 is very poor in both grammar 
and vocabulary but less poor in reading. Schools 862 and 806 are at 
par with respect to reading but considerably above average in grammar 
and vocabulary. These two schools are about equal to 819 and 466 with 
respect to reading but very different with respect to grammar and vo- 
cabulary. 

With respect to German II, schools 884 and 962 show very different 
achievements in grammar. Both these schools are much nearer to the 
state-wide average with respect to vocabulary and reading than with 
respect to grammar. 

On the whole it appears that some schools do emphasize grammar more 
than others, while other schools emphasize either vocabulary or reading. 


Vv 


INTERNAL STRUCTURE OF OLD- AND NEW-TYPE MODERN 
LANGUAGE EXAMINATIONS 


Thus far we have presented experimental evidence on the reliability, 
validity and comparability of the old- and new-type modern language 
examinations. The demonstrated advantages of the new-type over the 
old-type are in part due to the fact that only a sampling of the old-type 
papers are actually read by the central reviewing authorities, and to other 
unfortunate features of the Regents system that are not necessarily in- 
herent in the old-type subjective form of examination. We shall now 
attempt to show that, even if these unnecessary disadvantages under 
which the old-type operates in the Regents system were removed, the 
inherent and unavoidable limitations of the subjective form of examina- 
tion are so strict that it probably could never, under practical working 
conditions, equal the new-type in reliability, validity, comparability or 
administrative convenience, hour for hour of examination time, and dollar 
for dollar of cost. 

In order to justify this statement we shall consider in some detail the 
method of construction of the old- and the new-type tests used in this 
experiment, the size and character of the samplings of vocabulary elements, 
and the nature and measurement values of the various activities demanded 
of the student in the two types of examinations. In order to make our 
comparison as valid as possible, we shall include in our analysis: (a) one 
new-type French examination not used in the Regents experiment; (b) the 
1924 as well as the 1925 old-type Regents examinations in French, and 
(c) the 1924 and 1925 College Entrance Examination Board French 
examinations. 

Methods of construction. — The old-type modern language examina- 
tions used in this experiment were constructed and edited in accordance 
with the subjective judgments of committees composed of notable schol- 
ars and teachers. The members of the committees were appointed long 
before the examinations were actually made and some members had 
served on similar examination committees for several years, and therefore 
had considerable experience back of their judgment in such matters. 
There was one committee for each language, the State Supervisor of 
Modern Languages being an ex-officio member of all three committees. 
The examination committee for each language constructed the second-, 


third- and fourth-year examinations in that language. Each member of 
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each committee brought to the meeting at which the final form of the 
examination was decided upon, carefully prepared questions for one or 
several parts of the examination. The offerings of each member of the 
committee were discussed by the committee as a whole and the questions 
finally adopted represented the best questions brought to the meeting 
by the various members. There is no means of knowing how much time 
the members of the committees spent on their tasks before they came to 
the one and only conference of examination makers. But the conference 
at which one set of second-, third- and fourth-year examinations were 
finally constructed took up the better part of a working day. The writer 
of this report was privileged to meet with one or two of these committees 
and was greatly impressed by the seriousness with which the members 
took their tasks, and with the care exercised in criticizing and in improving 
the questions that were finally adopted for each of the examinations. 
The most impressive feature of the conference, however, was that all 
questions regarding the relative difficulties of two or more words, or of 
two or more sentences or paragraphs, or regarding the relative measure- 
ment values of two similar or dissimilar questions, were decided, as far 
as could be made out, entirely by subjective judgment. There was no 
reference made at any time during the meetings to any objective evidence 
concerning the difficulty or the measurement value of any of the questions 
discussed. There were frequent disagreements on such questions, and 
after free, and often spirited, discussion or argument the matter was 
settled by a vote of the members of the committee. After the examina- 
tions had been made by these committees they were taken to Albany by 
the supervisor and there passed through several editorial processes. 
Reproductions of old-type examinations. — ‘To illustrate the general 
character of the examinations constructed in this manner, the Regents 
French and physics papers used in June, 1925, are reproduced here. The 
reader is invited in passing to notice particularly two features of these old- 
type examinations. (a) The directions state that “The minimum time 
requirements (for admission to the examination) is five recitations a week for” 
a certain number of years (see second paragraph of heading of each ex- 
amination below, pp. 200, 202, 204 and 206). This is the same concrete 
manifestation of the old time-serving conception of education as is found 
on the College Entrance Examination Board examinations. It is a curious 
survival in an age which boasts the discovery and constructive exploitation 
of individual differences. (b) Modern language students are advised that 
“‘eredit for oral work,’’ as determined by local standards, may be sub- 
stituted for question 7, which calls for an original composition in the 
foreign language (see last paragraph of Note to the Student, pp. 200, 202 and 
204). These two features will be discussed in later sections of this report 


(pp. 306 and 317). 
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FRENCH — Two Years 
Thursday, June 18, 1925 — 1.15 to 2.45 p.m., only 


Write at top of first page of answer paper (a) name of school where you have studied, 
(b) number of weeks and recitations a week in French 1 and 2. 

The minimum time requirement is five recitations a week for a school year in 
(a) first year French, (b) second year French. 


Notre To THE STUDENT 


This part of the examination is limited in time to an hour and a half. Do not waste 
time. Do not make a preliminary draft to be copied later. Do first whatever question 
you can do most readily, but number your answers to correspond to the questions. 

Students who are entitled to credit for oral work may omit question 7. Students who 
are not entitled to credit for oral work should answer all the questions. Credit for oral 
work may be added on the answer paper directly to the mark obtained in the written exami- 
nation. 


1 Traduire en anglais: (Two credits to a line) [25] 


UN JOUR DE VACANCES 


C’était un jour de vacances: grande féte pour les enfants qui n’aimaient 
rien tant que d’aller a l’école, si ce n’est (unless 72 be) de rester 4 la maison 
et d’avoir devant eux toute une journée pour faire ce qui leur passerait 
par la téte. La mére était jJoyeuse aussi de garder ses enfants autour 
delle, mais pourtant un peu inquiéte et effrayée d’avoir A gouverner 
pendant douze heures, sans aucun secours (help), tout ce petit peuple tapa- 
geur (noisy), toujoyrs entre le rire et les larmes, entre une joie bruyante 
(noisy) et des désespoirs plus bruyants encore. Elle s’était endormie la 
veille au bruit de la pluie qui tombait, tombait, une de ces pluies tran- 
quilles et abondantes qui semblent avoir une ferme résolution de ne 
jamais s’arréter. En se réveillant, elle l’entendit encore. C’était un peu 
décourageant. 


2 Traduire en francais: [20] 


When we awoke this morning at half past seven it was raining. We 
couldn’t play in the yard (garden) because of the rain. After breakfast 
mother wished us to sit in the living room and read but George preferred 
to play cards. We had just sat down when my uncle came in and we had 
to listen to his stories. In the afternoon the rain stopped and we went 
for a walk. So we had a good time after all. 
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3 Ecrire les temps primitifs et la premiére personne singulier du futur 
de trois des verbes suivants: lire, écrire, croire, sortir, voir, prendre. [15] 


4 Compléter cing des phrases suivantes: [5] 
a Il m’aide.. . faire mes devoirs. 
b Apprenez-vous . . . parler frangais? 
c Ils n’ont plus... argent. 
d ... quoi pensez-vous? 
e Nous essayons . . . trouver ces papiers. 
f J'ai... papier jaune. 
g C’est une femme... . je connais le fils. 


5 Mettre la forme convenable du verbe entre parenthéses: [5] 
a Nous désirons qu’il (venir). 
b C’est moi qui les (avoir) écrits. 
c Elle s’est (casser) le doigt. 
d Nous ne nous sommes pas (voir). 
e Je ne crois pas qu’elle (devozr) faire cela. 


6 Traduire en francais cing des expressions suivantes: [10] 


(a) You ought to do it, (b) four months ago, (c) while reading his les- 
son, (d@) He must work, (e) It is late, (f) He is late, (g) They should have 
looked for it. 


7 Ecrire en francais une composition de 75 mots environ sur un des 
sujets suivants: [20] 

(a) La France, (b) Les Frangais, (c) La langue frangaise (le frangais), 
(d) L’école (le batiment, les matiéres, les professeurs). 


The University of the State of New York 
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FRENCH — Three Years 
Thursday, June 18, 1925 — 1.15 to 2.45 p.m., only 


Write at top of first page of answer paper (a) name of school where you have studied, 
(b) number of weeks and recitations a week in French 1, 2, and 3. 

The minimum time requirement is five recitations a week for a school year in 
(a) French 1, (b) French 2, (c) French 3. 


Note To THE STUDENT 


This part of the examination is limited in time to an hour and a half. Do not waste 
time. Do not make a preliminary draft to be copied later. Do first whatever question you 
can do most readily, but number your answers to correspond to the questions. 

Students who are entitled to credit for oral work may omit questions 6 and 7. Students 
who are not entitled to credit for oral work should answer all the questions. Credit for 
oral work may be added on the answer paper directly to the mark obtained in the written 
examination. 


1 Traduire en anglais: (Two credits to a line) [25] 


LE NOVEAU PENSIONNAIRE 


La femme et son fils étaient venus tous les deux en voiture. Elle était 
veuve, et fort riche, & ce qu’elle nous fit comprendre. Elle avait perdu 
le cadet de ses deux enfants, qui était mort un soir au retour de l’école, 
pour s’étre baigné avec son frére dans un étang malsain. Elle avait 
décidé de mettre |’ainé, Augustin, en pension chez nous, pour qu’il pit 
suivre le Cours Supérieur. Et aussitdét elle fit l’éloge de ce pensionnaire 
qu’elle nous amenait. Ce qu’elle contait de son fils avec admiration était 
fort surprenant: il aimait 4 lui faire plaisir, et parfois il suivait le bord de 
la riviére, Jambes nues, pendant des kilométres, pour lui rapporter des 
ceufs de canards sauvages. L’autre nuit il avait découvert dans le bois 
une faisane prise au collet dans des nasses qu’il avait tendues. 


2 Traduire en frangais: [20] 


a Do you intend to write to your mother today? 
b If he has enough money he will give us some. 
c Did you have a good time in the country last week? 
d He succeeded at last in buying the house. 
e I wish that you would speak more distinctly. 
f After leaving the house I lost my tickets. 
g She went home at three o’clock. 
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h How long have you been studying French? 
~@ I am glad I live so near the school. 
j It is very warm now, every day, even when it rains. 
3 Kerire les temps primitifs des verbes suivants: voir, lever, devenir, 
sentir, pleuvoir. [10] 
4 Traduire en francais: (a) Give it to him, (b) Do not tell it to her, 
(c) What are you thinking of? (d) What do you think of him? (e) What 
fell? [10] 


5 Employer cing des expressions suivantes dans des phrases courtes et 
traduire les phrases en anglais: 4 la bonne heure, de bonne heure, a l’heure, 
& temps, tout de suite, tout 4 coup, au devant de, 4 la rencontre de. [10] 


6 Compléter les phrases suivantes: [5] 


(a) Ma mére désire que ..., (b) Je le ferais, si vous..., (c) Il faut que 
léléve..., (d) Il est vrai que..., (e) Qui que vous..., voici votre 
maitre. 


7 Kerire en francais une composition de cent mots environ sur wn des 
sujets suivants: (a) Un livre frangais, (6) Un voyage, (c) Un Frangais 
célébre. [20] 


The University of the State of New York 
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FRENCH — Four Years 
Thursday, June 18, 1925 — 1.15 to 2.45 p.m., only 


Write at top of first page of answer paper (a) name of school where you have 
studied, (b) number of weeks and recitations a week in French 1, 2, 3 and 4. 

The minimum time requirement is five recitations a week for a school year in 
(a) French 1, (6) French 2, (c) French 3, (d) French 4. 


Nove To THE STUDENT 


This part of the examination is limited in time to an hour and a half. Do not waste 
time. Do not make a preliminary draft to be copied later. Do first whatever question you 
can do most readily, but number your answers to correspond to the questions. 

Students who are entitled to credit for oral work may omit question 4. Students who are 
not entitled to credit for oral work should answer all the questions. Credit for oral work 
may be added on the answer paper directly to the mark obtained in the written examination. 


1 Traduire en anglais: (Two credits to a line) [25] 


JEANNE D’ARC 


Une enfant de douze ans, une toute jeune fille, confondant la voix de 
son coeur avec la voix du ciel, congoit Vidée étrange, improbable, absurde, 
si ’on veut, d’exécuter la chose que les hommes ne peuvent plus faire, 
de sauver son pays. Elle couve cette idée pendant six ans sans la confier 
& personne; elle n’en dit rien, méme A sa mére, rien 4 nul confesseur. 
Sans nul appui de prétre ou de parents, elle marche tout ce temps avec 
Dieu dans la solitude de son grand dessein. Elle attend qu’elle ait dix- 
huit ans, et alors immuable elle l’exécute malgré les siens et malgré tout 
le monde. Elle traverse la France ravagée et déserte, les routes infestées 
de brigands, elle s’impose & la cour de Charles VII, se jette dans la guerre 
et dans les camps qu’elle n’a jamais vus. 


2 Traduire en francais: [25] 


Two men were neighbors and each of them had a wife and several little 
children. One of the men was uneasy, saying, “If I die or if I fall ill, 
what will become of my wife and children?” This thought never left 
him and it gnawed at his heart as a worm gnaws the fruit in which it is 
hidden. But the other man lived in peace. ‘‘Because,’’ said he, ‘God, 
who knows all his creatures and watches over them, will also protect me 


and my wife and my children.” 
204 


NEW YORK HIGH SCHOOLS 205 


3 Ecrire en francais une composition de 100 mots environ sur un livre 
que vous avez lu ow un voyage que vous avez fait. [20] 


4 Répondre & toutes les demandes suivantes: 

a Indicate the main difficulty or peculiarity in the sound of one 
consonant in each of the following words or groups of words: 
absurde, soixante, second, calme, Bruxelles, nom anglais, cent 
un, vaciller, sang impur, grand homme. [10] 

b Write five French sentences, using in each a different compound 
conjunction followed by the subjunctive. [10] 

c Translate: by going; without going; instead of going; after return- 
ing; I have finished reading. [10] 


The University of the State of New York 
233p Hicgh ScHooL EXAMINATION 
PHYSICS 
Thursday, June 18, 1925 — 9.15 to 10.45 a.m., only 


Write at top of first page of answer paper (a) name of school where you have 
studied, (b) number of weeks and recitations a week in physics, with the total number 
of laboratory periods and the length of such periods. 

The minimum time requirement is five recitations a week for a school year. A 
double laboratory period counts in place of one reese Laboratory work equiva- 
lent to 30 double periods is required. 

This part of the examination is limited in time to an hour and a half. 


Answer five questions, selecting at least one from each group. 


Group I 
Answer at least one question from this group. 

1 Using a single sentence in each case, state (a) the effect of change of 
temperature on the speed of sound [2], (b) the cause of differences in the 
intensity of sound [2], (c) how the particles of air move when a sound 
wave is coming toward one [2], (d) how an echo is produced [2], (e) how the 
length of an organ pipe affects the pitch of the sound produced [2]. 

2 Name the chief method of heat transference in each of the following 
cases: (a) from the sun to the earth [2], (6) through the iron of a stove or 
of a radiator [2], (c) in heating a room with a hot air furnace [2]. In one 
of the above cases describe the process by which the heat is transmitted [4]. 

3 Describe an experiment to show how the principal focus of a convex 
lens is found [5]. If the focal length of the lens is found to be 16 cm, how 
far from the lens is the image of an object placed 32 cm from the lens [5]? 


Group II 
Answer at least one question from this group. 
4 On a 110 volt circuit, find (a) the cost of operating for two hours, at 
8 cents a kilowatt hour, a projection lantern equipped with a 1000 watt 
lamp [4], (b) the current flowing through the lamp [8], (c) the resistance 
of the lamp [8]. 


5 By means of a labeled diagram show the path of the current through 
a continuous ringing electric bell [5]. Add to the diagram so as to show 
a push button and two dry cells connected as in a house [3]. Are the 
parts of this circuit connected in series or in parallel [2]? 
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Group III 
Answer at least one question from this group. 


6 State the principle of Archimedes [8]. Explain why it does or does 
not apply to a floating body [4]. State just why the following statement 
is incorrect: “Ice formed in sea water is less dense than that formed in 
fresh water, since sea water ice floats higher than fresh water ice”’ [3]. 

7 A loaded sleigh weighing 3000 pounds is just prevented from sliding 

‘down a hill that rises 8 feet in 100 feet of road. What is the force exerted 
by the team of horses [5]? How much work against gravity would the 
team do in drawing the sleigh 200 feet up the hill at uniform speed [5]? 

8 State the law of conservation of energy [3]. Give an example of a 
body possessing kinetic energy and one of a body possessing potential 
energy [2]. State the law of motion illustrated when a gun kicks [2]. 
Explain its application to this case [3]. 


Method of constructing new-type tests. — The construction of the new- 
type modern language examinations used in this experiment was a much 
longer process. The forms of the questions were decided upon after three 
years of extensive experimentation with more than 20 different forms of 
modern language examination questions. The vocabulary samplings 
used were based upon word counts of sixteen widely used textbooks in two 
languages and upon a synthesis of various word counts of textbooks and 
syllabi in the third language. The grammar and idiom content of the new- 
type tests was determined item by item, on the basis of extensive syntheses 
of standard grammars and syllabi. These selections of materials were 
finally checked up by actual experiment. No item was accepted for final 
use unless it gave a positive correlation with known criteria of achievement. 
Each new-type examination was constructed to cover the whole range of 
achievement in high school modern language work. That is, each exami- 
nation included some items easy enough to be answered correctly by the 
poorest students at the end of the first year and some questions so hard 
that they could be answered only by the best students at the end of the 
fourth year. The “easiness” or “hardness” of each individual item was 
determined upon the basis of the responses by numbers of modern language 
students in high schools ranging from several hundred to several thousand. 
No appeal was made at any time to the subjective judgment of scholars 
regarding the difficulty or the measurement value of an item. 

The most striking difference between the two methods of construction, 
aside from the difference in quality of the resulting products, is that the 
new-type method shows a much more adequate appreciation than the old- 
type method of both the difficulty and the importance of constructing 
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sound examinations. In the old-type very much is left to subjective 
opinion and to chance; in the new-type everything that is capable of 
objective verification is experimentally verified. It is because of the 
reliance on experimental evidence that the new-type examinations are 
constructed so as to cover the whole range of achievement in a single 
examination, thus making possible comparability between the achieve- 
ments of the various classes; and that such comprehensive examinations 
are at the outset made in several equivalent forms, thus making possible 
comparability of measurements at different stages of the student’s progress. 

Advantages of one examination for all year-classes.— The advantages 
of one examination covering the whole range of achievement in high school 
modern language work has been demonstrated in the comparisons made in 
previous sections of this report between the achievements of second-, 
third- and fourth-year classes. It -is obvious that such comparisons 
could not be made on the basis of the old-type examinations. The value 
of using forms of examinations which are known to be equivalent in diffi- 
culty and in representative character of modern language materials, may 
be illustrated by comparisons which we are enabled to make between the 
vocabulary achievements of students of Spanish in the junior high schools 
of New York City and in the senior high schools in New York State. 
The 100-word Spanish vocabulary test used in the new-type Regents 
examination happened to be exactly equivalent to that in the new-type 
test used in the junior high schools of New York City in June, 1925. 
The median scores of the 9B and RD junior high school classes on the 
Spanish vocabulary test were 57 and 58, respectively... The state-wide 
medians of Spanish II and III students on the vocabulary Part of the 
new-type Regents examination are 70 and 81, respectively. It is quite 
reasonable to infer that Spanish I senior high school students would 
achieve an average score at the end of the course, i.e., at the end of one 
year of study, about 12 points below the average of Spanish IT students, 
i.e., about 57 or 58. Thus we have objective evidence in support of the 
theory that the junior high school students on the average achieve about 
as much in four or three semesters as senior high school students (who 
survive to the end of the second year) achieve in two semesters. The 
overlapping of the junior high school fourth- and third-semester students 
and of the second-year senior high school students is about the same 
as the overlapping of the latter and the third-year senior high school 
students. About 13% of the 9B and RD Spanish students in the junior 
high schools surpass the average of the Spanish II senior high school students 
with respect to vocabulary. About one-sixth of the “rapid advancement”’ 
students in the junior high school Spanish classes achieve as much or 


1 The 9B class is the fourth-semester class of ‘‘normal’’ students and the RD class is the third and final 
class of ‘‘rapid advancement’’ students. 
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more in three semesters with respect to vocabulary as the average Spanish 
II students after four semesters in senior high school Spanish work.! 

Reproductions of new-type Regents examinations of June, 1925. — 
Except for the columns of figures at the left, and except for minor differ- 
ences in size of type and in spacing, the new-type parts of the Regents 
examinations of June, 1925, are reproduced below exactly as they were 
presented to the high school students. Correction of some obvious de- 
fects has been omitted so that the reader may secure as accurate a picture 

‘of the experiment as possible. The first column of figures at the left 
shows for each of the questions the per cent of random samplings of stu- 
dents that answered it correctly. Column 2 shows the order of difficulty, 
from easiest to hardest, of the items within each Part of the examinations. 
Column 3 shows the order of difficulty of all the items in each examina- 
tion. Thus the first question in Part I of the French examination is the 
fifth easiest of the 100 items in Part I, and the 10th easiest of the 275 
questions in the whole French test. The fourth column shows for each 
item in each test the difference between the per cent of fourth-year and 
the per cent of second-year students that answered it correctly. In the 
case of the: Physics test the difference is between the per cent of highest- 
quarter students and the per cent of lowest-quarter students that an- 
swered correctly. A minus sign before a figure in Column 4 indicates 
that a larger proportion of lower students answered the item correctly 
than of more advanced or better students. — 

The fifth column shows the order of “goodness” or of validity of all 
the items in each test. 

The sixth and the seventh columns, appearing only in Part I of the 
French test, show the Henmon and the Wood frequencies of each French 
word in the test. The Henmon frequency means, e.g., that the first 
French word in Part I, nom, occurred 159 times in 400,000 words of run- 
ning discourse; the Wood frequency means that the word nom was com- 
mon to sixteen widely used French textbooks. 

These columns of figures will be explained later in this report; for the 
moment the reader may confine his attention to the form, content and 
structure of the examinations. For convenience of reference the pages for 
each Part of each test are as follows: 

French, Part I 210-216; Part II 217-221; Part III 222-229. 

Spanish, Part I 230-235; Part II 236-240; Part HI 241-249. 

German, Part I 250-255; Part II 256-259; Part III 260-269. 

Physics, (144 true-false statements) 270-282. 

The reader will find it convenient to consult these references when the 
discussion of the columns of figures at the left of the questions is resumed 
below or page 296, and particularly on page 302. 

1Cf. Junior High School Survey, above, pp. 9 to 38. 
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Size of vocabulary samplings. — A fundamental factor in the validity 
of any examination is the size of the sampling of materials and tasks which 
it involves. A simple illustration of this principle may be taken from 
the field of spelling. Every one knows that a spelling test of only ten 
words will not give an adequate measure of an individual’s English spelling 
ability. A 50-word spelling test might give a fairly reliable indication 
of the average spelling ability of a whole class, but there are few scholars 
who would base their judgment of an individual’s spelling ability on the 
results of fifty words. Other things being equal, the larger the number 
of words in a spelling test, the more accurate and reliable is the resulting 
indication of an individual’s spelling ability. The phrase “other things 
being equal” here covers a multitude of sins because the character of the 
sampling is quite as important as the size; but for the moment let us keep 
in mind only the matter of size of sampling. If the new-type form. of 
examination permits us to use a larger sampling of modern language 
materials than the old-type, other things remaining equal, the new-type 
is a more valid form of examination. 

Vocabulary sampling of ninety-minute new-type test about twice as large 
as that of three-hour old-type examination. — A vocabulary analysis of the 
old-type Regents and College Entrance Examination Board French exam- 
inations of June, 1924, and June, 1925, shows that the numbers of differ- 
ent French words involved in old-type three-hour examinations, exclusive 
of the English-to-French translation and of the free composition questions, 
range from 206 to 270 root words. The new-type 90-minute Regents 
examination in French of June, 1925, includes 459 different French words, 
and the new-type 90-minute junior high school French examination of 
June, 1925, involves 529 different root words. Inflected forms of the 
same root words were not counted as additional words; that is, one verb 
might appear in two or three tenses and in two or three numbers, but 
these were all counted as one root word. 

That the new-type 90-minute examinations permit a sampling of root 
words roughly twice as large as the sampling permitted by the old-type 
three-hour examinations is a fact of tremendous significance. This fact 
alone is sufficient to account for many of the advantages of the new-type, 
in terms of reliability and validity, pointed out in earlier sections of this 
report. It should be noted, however, that this comparison relates only 
to size of vocabulary sampling and, moreover, that it does not take into 
account two important parts of the old-type examinations, namely, the 
English-to-French translation question and the free composition question. 
These two questions probably would add between 50 and 75 different 
root words to the numbers reported above for old-type three-hour exami- 
nations. But since these two questions are generally allowed about 
a third of the total credit of an old-type three-hour examination and 
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therefore presumably take up about one of the three hours of time, we 
may leave the numbers as reported above and bear in mind that we are 
comparing a two-hour old-type with a 90-minute new-type examination. 

That this comparison is limited to size of vocabulary only should not 
lead us to underestimate its importance. It may be objected, for ex- 
ample, that although the old-type includes a much smaller sampling of 
different root words, it makes up for this deficiency by the character of 
the words used and by the richness of the grammatical and the idiomatic 
contexts in which these words appear. The answer to this objection is 
that the display of grammatical and idiomatic knowledge by a student 
depends quite as much upon the specific vocabulary presented to him as 
upon his knowledge of the principles of grammar involved. _Many idioms, 
for example, are inseparable from certain specific words, so that there is 
little room for doubt that any limitation of, vocabulary sampling is neces- 
sarily a measurable limitation of grammar and idiom sampling. 

Character of vocabulary sampling. — The objection mentioned in the 
preceding paragraph, that the character of the vocabulary sampling in 
the old-type might be such as to make up for its small size, would be 
very serious indeed if supported by evidence. It is quite obvious that 
merely increasing the number of very easy words which all students know, 
or of very difficult words, which very few students know, does not in- 
crease the effective sampling or the real validity of an examination. The 
existence of Professor Henmon’s French Word Book! and of the Wood 
French word list ? makes possible an objective study of the character of 
the vocabulary samplings in the old- and new-type French examinations 
with which we are now concerned. Table 37 shows ordinary and cumu- 
lative distributions, according to frequency of occurrence, of the 3900 
words in the Henmon French Word Book and of the 2683 words in the 
Wood French word list. 

Henmon and Wood French Word Frequency Distributions. — According 
to the Henmon French Word Book there are 153 different root words 
which occur between 200 and 27,000 times in 400,000 words of running 
discourse; and according to the Wood French word list there are 134 root 
words which are common to 16 widely used French textbooks. These 
words are the essential words of the language which all students neces- 
sarily learn in the first year of modern language work, if they learn any- 
thing at all about the modern language. As far as vocabulary sampling 
in a test is concerned, the inclusion of a large number of these easy words 
would be wasteful and would not add to the effectiveness of the sampling. 
Certainly not more than 5 or 10% of the different words in a test should 

1Henmon, V. A.C., A French Word Book, University of Wisconsin Bulletin No. 3, Sept. 1924. This 
study has been extended by Ward (C. F.) in his book entitled Minimum French Vocabulary Test Book. 


2Wood, Ben D., A Comparative Study of the Vocabularies of Sixteen French Textbooks, The Modern 
Language Journal, February, 1927. 
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come from this group. For third- and fourth-year students, words from 
this group probably have no measurement value as far as vocabulary alone 
is concerned. Similarly, words which occur fewer than five times in 
400,000 words of running discourse, or words which are common to fewer 


TABLE 37 


The Henmon and Wood French Word Lists 


DISTRIBUTIONS OF 3900 ROOT WORDS OCCURRING FIVE TIMES OR OFTENER IN 400,000 
WORDS OF RUNNING DISCOURSE ACCORDING TO FREQUENCY OF OCCURRENCE, AND OF 
2683 WORDS COMMON TO FOUR OR MORE OF SIXTEEN TEXTBOOKS ACCORDING TO NUMBER 
OF BOOKS TO WHICH THEY ARE COMMON. THE TABLE IS TO BE READ AS FOLLOWS: 
First LINE — 59 ROOT WORDS OCCURRED BETWEEN 500 AND 27,000 TImEs rN 400,000 
WORDS OF RUNNING DISCOURSE; AND 134 WORDS WERE FOUND COMMON TO SIXTEEN 
WIDELY USED GRAMMARS AND COMPOSITION BOOKS. SECOND LINE — 94 worDS oc- 
CURRED 200-499 TIMES, AND 153 WORDS OCCURRED 200 TIMES OR OFTENER IN 400,000 
WORDS OF RUNNING DISCOURSE; 110 WORDS WERE COMMON TO FIFTEEN, AND 244 To 
FIFTEEN OR SIXTEEN TEXTBOOKS. S®VENTH LINE — 1035 worps occurRED 10-19 
TIMES, AND 2533 WORDS OCCURRED 10 TIMES OR OFTENER IN 400,000, mTc.; 389 worDs 
WERE COMMON TO FIVE TEXTS, AND 2208 WERE FOUND COMMON TO FIVE OR MORE OF 
SIXTEEN TEXTBOOKS. 


THE CORRELATION BETWEEN HENMON AND Woop FREQUENCIES 1s 0.70. 


Henmon Frenca Worp Book Woop Frenca Worp Lisr 

Henmon Frequencies fH cpt ieee a fw Oe 
(1) | 500—27,000 59 59 16 134 134 
(2) 200-499 94 153 15 110 244 

(3) 100-199 169 322 14 142 386 
(4) 60-99 210 532 13-12 240 626 
(5) 40-59 280 812 11-9 463 1089 
(6) 20-39 686 1498 8-6 730 1819 
(7) 10-19 1035 2533 5 389 2208 
(8) 5-9 1367 3900 4 475 2683 

: raw = 0.70 


than four of sixteen widely used textbooks, almost certainly have very 
little measurement value for first- and second-year students. We should 
therefore expect the second-year old-type examinations to contain larger 
proportions of the very easy words than the third- and fourth-year old- 
type examinations, and the latter to contain larger proportions of the more 


difficult words, 
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Table 38 shows the per cents of words from each of the Henmon and 
Wood frequency groups in the old-type French examinations, and Table 
39 shows similar per cents for the new-type Regents and junior high 
school French examinations of June, 1925. 

New-type superior to old-type vocabulary sampling according to Henmon 
and Wood French word lists. — Table 38 shows that the character of the 
vocabulary samplings in the old-type examinations is in every way in- 
ferior to that of the new-type samplings, if the Henmon and Wood French 
word lists have any validity at all. There is practically no difference 
between the difficulties of the vocabulary samplings of the second-, third- 
and fourth-year old-type examinations. There is practically as large a 
proportion of very easy words in the fourth-year examination as in the 


TABLE 38 


DISTRIBUTIONS ACCORDING TO HENMON AND WooD FREQUENCIES OF ROOT WORDS 
IN OLD-TYPE REGENTS AND CoLLEGE ENTRANCE EXAMINATION BoarRD EXAMINA- 
TIONS IN SECOND-, THIRD- AND FOURTH-YEAR FreNcH or 1924 anp 1925. THESE 
ARE ALL THREE-HOUR EXAMINATIONS, EXCEPT THE 1925 REGENTS, WHICH ARE 90-MINUTE 
EXAMINATIONS. ‘THE FREE COMPOSITION QUESTIONS IN ALL THESE EXAMINATIONS ARE, 
OF COURSE, NOT ACCOUNTED FOR IN THIS TABLE; THE ENGLISH-TO-FRENCH TRANSLATION 
QUESTIONS ARE ACCOUNTED FOR ONLY IN THE CASE OF THE REGENTS 1925 EXAMINA- 
TIONS. (REGENTS 1924 FOURTH-YEAR EXAMINATION WAS NOT AVAILABLE FOR THIS 
STUDY.) 


Regents Old-Type Examinations 


1924 3-Hour EXAMINATIONS 1925 90-MinuTE EXAMINATIONS 
HENMON 
FReE- 2nd Year 3rd Year 4th Year 2nd Year 3rd Year 4th Year 
QUENCIES 
in Lae As eS a A es N| % |N|% |N| % 
500+ 52 19.2 46 | 22.2 40 | 28.5 44 | 26.5 | 42 27.4 
100-499 81 | 29.9 55 | 26.4 56 | 33.0 48 | 28.8 | 40 26.1 
40-99 52 19.1 31 14.8 26 15.3 25 15.0 | 28 18.4 
20-39 31 11.5 22 10.6 19 12 17 10.3 Le Lied 
5-19 30 | 11.1 30 | 14.5 22 12.9 20 12.1 14 9.2 
0-4 25 9.2 24 11.5 7 4.1 12 1.3 12 7.8 
Totals 271 208 170 166 153 
Woop 
FRE- 
QUENCIES 
16 68 | 25.1 52 125.0) 54 | 31.8 OOM Oso eo 33.3 
15 40 | 14.8 28 13.5 30 17.6 24 14.6 | 22 14.4 
14 26 9.6 LZ 8.2 19 1d 19 11.5 18 ai ey 
13-12 43 15.8 23 1 td Up) 20 11.8 18 10.8 15 9.8 
11-9 29 10.7 18 8.6 16 9.4 13 7.8 13 8.5 
8-6 19 7.0 17 8.1 13 7.6 14 8.4 a 4.6 
5-4 12 4.4 17 8.2 6 3.6 4 2.4 ri 4.6 
3-0 34 12.6 36 Acer 12 TAG) 22 13.3: 20 Ph 
Totals perfil 208 170 166 153 
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287 


HreNMON 
FRer- 1924 3-Hour EXAMINATIONS 1925 3-Hour EXAMINATIONS 
QUENCIES 
500+ 49 | 23.8 HSO0a ies 47 | 19.5 52 | 21.9 48 | 18.9 | 46 17.0 
100-499 51 | 24.7 57 | 24.3 68 | 28.2 TRS |) BEE 72 | 28.6 | 80 29.6 
40-99 Nove |i aR 45} 19.1 47 | 19.5 ol |) Palys) 619 (520/25) 86 13.4 
20-39 30 | 14.6 30 | 12.8 23 9.5 20 8.4 24 9.4 | 38 14.1 
5-19 20 9.7 o2 | 13.6 36 | 15.0 23 9.7 29 | 11-45 | 44 15.2 
0-4 19 9.3 21 8.9 20 8.3 16 6.8 297 11.5 | 29 10.7 
Totals 206 235 241 237 253 270 
Woop Frr- bh og 
QUENCIES 
16 62 | 30.1 61 26.0 55 | 22.8 > || SOE33 62) 25.13) 58 215 
15 rN all 0B LS 33 | 14.0 Sy] SBR 31 13.0 44 | 17.4 | 37 13:7 
14 21 10.2 20 8.5 20 8.3 32 13.5 yey \h Alihals |) Bs 8.5 
13-12 24) 11:6 SU maen 30 | 12.5 28 | 11.8 Pees || ial) |) ei 115 
11-9 18 8.8 23 9.9 33 13.7 26 11.1 19 Gaomnol 11°55 
8-6 21 10.2 20 8.5 20 8.3 15 6.4 24 9.5 \=23 8.5 
5-4 9 4.3 14 6.0 18 1.6 9 3.8 10 4.0 | 23 8.5 
3-0 Pied cod 34 14.4 33 LS ¥ 24 10.1 36 | 14.2 | 44 16.3 
Totals 206 235 241 237 253 270 
TABLE 39 


DISTRIBUTIONS ACCORDING TO HENMON AND WooD FREQUENCIES OF ROOT WORDS 
IN THE NEW-TYPE JUNIOR HIGH SCHOOL AND REGENTS EXAMINATIONS IN FRENCH 
or JUNE, 1925. THESE TWO EXAMINATIONS ARE IDENTICAL IN GENERAL FORM, BUT 
Part II, READING-COMPREHENSION, IS MADE UP OF MULTIPLE CHOICE QUESTIONS IN 
THE JUNIOR HIGH SCHOOL AND OF TRUE-FALSE QUESTIONS IN THE REGENTS EXAMINATION. 
New-Type Examinations in French of June, 1925 


HrENMON 
FREQUENCIES 


500+ 
100-499 
40-99 
20-39 
5-19 
0-4 


Totals . 


Woop FReE- 
QUENCIES 


Totals . 


Junior HicuH ScHoou: 90 Minutes 


Recents: 90 Minutes 


Pt. I} Part Il Part III |Whole Test |/Pt. I} Part II Part III | Whole Test 
ae me 
7 N 0 N % N % G N % N % N % 
70 (9) 

3: | 53 | 12:0} 28 | 37.5.) 54 | 10.2 1 | 51 | 14.9} 30 | 44.1] 55 | 12.0 
FOF 99) 224521 28.O) I ae ho: (96th 2 el ale 24 Ot at 2b. 
19 |109 | 24.6] 12 | 15.9 |124 | 23.4 9 | 66 | 19.1 7 | 10.4) 80 | 17.4 
17 | 62 | 14.1 9 | 12.0] 79 | 14.9 || 17 | 68 | 18.3 Ca DOSES I Al aa 
DT te! | Lore 1 1.3] 95 | 18.0 || 38 | 47 | 13.7 6 8.8 | 86 | 18.7 
15 | 46 | 10.4 4 Deo P02) eld. S) || 20) eal 6.1 1 1.5 | 40 8.7 

100 |442 75 529 100 |344 68 459 
1D FOE 2057 30 40.0} 98 | 18.5 SalecO 12210 80! F44ed | Sat VSO 

GOD 7a Nel Ou teenie 4avel| cO2el| uid 8 |} 46 |] 13.3 SaleLES | ol ete 
TO 49 [PLE 16s 1323159" ie 6 | 45 | 13.1 7110.8) 53 | 11.6 
18 | 82 | 18.5 9 | 12.0| 94 | 17.7 AO 16:60) Loe) hi) 6a ela 

Boel Ail Male) 5 6.7 | 74 | 14.0 4} 53 | 15.3 7 | 10.2) 60 | 13.1 
29 | 43 9.7 3 4.0 | 79 | 13.1 |} 34 | 22 6.4 2 SOMMOSH ad 2e6 
21 AG 3.6 35 6.7 |; 39 9 PAT 1 1.5 | 48 | 10.5 

33 7.5 a 9.3 | 38 1.2 36 | 10.5 1 15 | 37 8.1 
100 |442 75 529 100 |344 68 459 


288 NEW-TYPE MODERN LANGUAGE TESTS 


second-year examination; and the proportions of difficult words in the 
second-year examinations are almost as great as in the fourth-year exami- 
nations. Some of the opponents of the objective tests are particularly 
suspicious of the value of the new-type vocabulary test. The statement 
is frequently made that they ‘do not care about mere vocabulary.” Yet 
10% of the words in the old-type second-year examinations occurred 
fewer than five times in 400,000 words of running discourse and 15% of 
them are common to fewer than four of sixteen widely used textbooks; 
in other words, the makers of old-type tests go into the fifth thousand 
of French words in their second-year examinations. This is indeed “not 
caring about vocabulary” with a vengeance. 

The facts of Tables 38 and 39 are summarized graphically in Chart 44. 
The indications for these old-type examinations may be accepted as 
thoroughly representative of old-type French examinations generally, 
because they are based on analyses of eight full three-hour examinations, 
excluding English-to-French and free composition questions, and on three 
ninety-minute old-type examinations, excluding the free composition 
questions. The proportions of words in these examinations from the several 
Henmon and Wood frequency groups would probably not be observably 
changed by the inclusion of the English-to-French and free composition 
questions. The proportions for the three ninety-minute Regents exami- 
nations, in which the English-to-French questions are included in the 
vocabulary analysis, are not markedly different from those for the other 
old-type examinations. 

It is apparent from Chart 44 that all of the old-type examinations 
include almost equal proportions of words in the highest Henmon fre- 
quency groups, that is, of words that occur one hundred times or oftener 
in 400,000 words of running discourse. Each of the old-type examinations 
contain 42 or more of the 59 words that occur 500 times or oftener in 
400,000 words of discourse. Such words can have little if any measure- 
ment value, as vocabulary sampling, except possibly in the first- and 
second-year classes. That they represent from a fifth to a fourth of the 
total number of different root words in the fourth- and third-year old-type 
examinations may therefore be taken to mean that the effective vocabulary 
sampling of the old-type examinations for these year-classes is only four- 
fifths or three-fourths as large as the total numbers of different root words 
in them, that is, an average of less than 200 rather than an average of 
240 words. It may be argued, perhaps with considerable reason, that these 
fundamental and indispensable words are essential for the testing of other 
elements such as grammar, idioms, syntax, ete., in the higher years; but 
this argument cannot even remotely be used in justification of the nearly 
equally large proportions of very difficult words (from the opposite end 
of the frequency scale) in the second-, third- and fourth-year examinations. 
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[_]500+ First 59 words 
F100 - 499 First 322 words 


Hi 5 - 99 First to fourth thousand 


Number of 
Unduplicated Words 


E=]0- 4 Fifth thousand 


I I Hl Jr.H.S. Regents 


a ure 1925 - 1% hours 1% hours each 


College Entrance Examination Board 
French Examinations - 3 hours each 


Regents French Examinations 


New Type 


Cuart 44.— Numbers of different root words in old- and in new-type tests dis- 
tributed according to Henmon frequency groups. 


Chart 44 shows that, on the whole, the second-year examinations con- 
tain as large proportions of words not in the Henmon Word Book as the 
third- and fourth-year examinations do. It is difficult to find any plausible 
excuse for the inclusion of almost as many fifth-thousand words in the 
French II examinations as in the French IV examinations. There are 
certainly not many grammatical or idiomatic forms that second-year 
students ought to know which depend on the use of words that will not 
be encountered five times in 400,000 words of general reading matter. 
It seems more likely that the inclusion of the rare words in the second- 
year examinations, and also the inclusion of the very easy words in the 


290 NEW-TYPE MODERN LANGUAGE TESTS 


fourth-year examinations, is due to the chance factors inherent in the 
subjective and unscientific method of constructing old-type examinations. 

Not infrequently one hears modern language teachers say that in their 
opinion a certain old-type examination is easier than the one designed 
for the next lower class, or harder than the one for the next higher class. 
And it is a matter of record that considerable numbers of students some- 
times pass a third- or fourth-year examination on the same day that 
they fail a second- or third-year examination.'' Chart 44 affords objec- 
tive evidence that the old-type examinations for second-, third- and fourth- 
year classes are at least very nearly equal, if not in fact slightly inverted, 
in difficulty as far as vocabulary is concerned. That such differences as 
do exist are slight, and do not always tend to make the higher examina- 
tions harder than the lower, is clearly indicated in Chart 45. 

Old-type second- and fourth-year ecaminatians nearly equal as to vocabulary 
difficulty and probably also as to grammar and syntax. — The importance 
of this indication is that the second-, third- and fourth-year examinations 
are very probably as nearly equal in respect to grammar, idiom and syn- 
tax difficulties as in respect to vocabulary difficulties. This inference 
seems justified for at least two reasons. In the first place, the method 
of selecting the grammar, idiom and syntax elements for the old-type 
examinations is just as subjective as the method of selecting the vocabu- 
lary content, and there is no reason for supposing that the modern lan- 
guage scholars who make up the old-type examinations would make more 
errors in judging the difficulty and measurement value of vocabulary ele- 
ments than of grammar and idiom elements. In the second place, even 
if the actual rules of grammar and syntax and idioms in the elementary 
examinations are easier than in the advanced, their greater easiness might 
be effectually masked by the strange and unknown words which are their 
vehicles. It is certainly conceivable that a student might miss a perfectly 
simple point of grammar not because he is ignorant of it, but because it 
occurs in a context of words which will not be encountered five times in 
400,000 words of French discourse, and which are not common to more 
than three of sixteen widely used textbooks. With such nearly equivalent 
examinations being used for second-, third- and fourth-year classes, we 
can no longer be surprised at the large overlapping of classes set forth 
earlier in this report. Students of only second-year ability in French who 
happen to be in a French IV class have almost as good a chance of passing 
the fourth-year examination as of passing the second-year examination. 
In view of these facts it is idle to suggest that some of the weaknesses of 
the old-type examinations set forth in these pages, are in reality due to 

1A member of the Advisory Board of the Regents Examinations told the writer of this Report that 


a few years ago the Board interchanged the names of the third- and fourth i 
y -year pa 
Committee for one of the modern languages. Se 
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4th YEAR 


Regents | C.E.E.B. C.E.E.B, 


wes «Words occurring 100 times or more in 400,000, i.e. in first 322 words 
ee “ “a 500 “ee “a “a “ee “e “a “ae “a 60 “a 


“si ae less than 5times “ a “5th thousand words 


Cuart 45. — Proportions of root words in old- and in new-type French examinations 
of indicated Henmon frequencies, showing that there are only small differences between 
the vocabulary difficulties of 2nd- and 4th-year old-type examinations, and that all 
old-type French examinations here represented contain large proportions of words 
that have little measurement value, so far as vocabulary is concerned. The 1925 
Regents old-type and both new-type tests are 90-minute examinations; all others are 
3-hour examinations. This chart is made from the data of Tables 38 and 39, q.v. 
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the invalidity of the new-type test which has been used as a sort of check 
on the other type. 

Proportions of high- and low-frequency words in Regents and College 
Entrance Examination Board old-type examinations and in Regents and 
junior high school new-type tests. — In Chart 45 the old-type examinations 
are arranged in ascending order of year-classes from left to right, with 
the new-type French examinations at the extreme right. The dash-lines 
connecting the points of the new-type with the curves of the old-type are 
put in merely to identify the corresponding frequency groups; they should 
not be thought of as continuations of the curves. Both the general con- 
tours and positions, and the jagged nature of the curves, show that chance 
had considerable influence on the character of the vocabulary samplings 
of the old-type examinations. 

The lowest of the three curves: shows the proportion in each of the 
examinations of words which occurred less than five times in 400,000 
words of discourse. These are words from the fifth thousand, or higher 
thousands, of French words according to frequency of use in general 
French reading matter. Such rare words certainly have a place in fourth- 
year examinations, but they can have very little measurement value, if any, 
in first- and second-year examinations, and they are of doubtful value in a 
third-year high school examination. We should therefore expect that 
there would be very few of these words in second-year examinations and 
a moderate number in fourth-year examinations; in other words, that the 
lowest curve would ascend from near the base line at the left to about 
10% at the right. As a matter of fact, however, the curve, in spite of 
irregularities, is practically flat in general trend. About 11% of the 
words in two of the four old-type third-year examinations represented in 
the chart, are in this category; in only one of the three fourth-year exami- 
nations is the proportion of such words as great as 11%, the proportions 
in the other two being 8 and 9 per cent. Two of the four second-year 
examinations have proportions of 9% of such words. In other words, 
we are asking second-year students to know as many fifth-thousand 
words as we ask fourth-year students to know, and are also asking second- 
year students to make their knowledge of grammar, idioms, syntax, ete., 
shine through the mists of these rare words, or remain invisible, unhonored 
and unsung, but not unwept. 

The middle curve shows the proportion of words in the old- and the 
new-type tests from the group of 59 words which occur 500 times or oftener 
in 400,000 words of running discourse. Words from this group might 
logically be expected in first- and second-year old-type examinations but 
they probably have little, if any, measurement value for students above 
the second-year level. We should therefore expect the middle curve to 
slant downward to the right since the old-type examinations are arranged 
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in order of year-classes from left to right. The examination which shows 
the highest proportion of such easy words, however, is a fourth-year 
examination and the second-highest proportion is shown by a third-year 
examination. One of the second-year examinations is among the four 
old-type papers that show the lowest proportions of such easy words. 

The topmest curve in Chart 45 shows for each of the examinations the 
proportions of words that occur 100 times or oftener in 400,000 words of 
running discourse. According to the Henmon Word Book there are only 
322 different French root words that occur 100 times or oftener. We 
would expect to find larger proportions of the first 300 words in the 
second-year examinations than in the fourth-year examinations, that is, 
if the examinations were constructed in the best possible manner we would 
expect the topmost curve in Chart 45 to slant downward to the right. 
But here again chance seems to have had as much influence as in the 
choice of the very difficult words. Forty-nine per cent of the different 
root words in the 1924 College Board second-year French examination 
are from the first 322 words according to the Henmon list, and corre- 
sponding per cents for the two fourth-year College Board papers are 47 
and 48. The examination which has the lowest proportion of the first 
322 words is the College Board third-year paper of 1924. 

The proportions of easy and difficult words of new-type tests were also 
determined on the basis of objective evidence. The proportion of fifth- 
thousand words for the Regents new-type examination is 9% and for the 
junior high school 12%. The larger proportion of rare words in the junior 
high school test is due to the fact that the alternative answers in the 
multiple choice form of reading test were purposely allowed to include, 
in a few cases, unusual words. Twelve per cent of the words in the new- 
type tests are from the group of 59 words which occur 500 times or oftener 
in 400,000 words of running discourse, and 32% and 37% of the new-type 
test words are from the first 322 words according to the Henmon French 
Word Book. In an examination which is designed to cover the whole 
range of achievement in high school modern language work the propor- 
tions for the new-type tests seem entirely reasonable. 

Chart 46 is parallel to Chart 45 in general form, and shows the propor- 
tions for each of the old- and new-type examinations of words which are 
common to fewer than four of sixteen textbooks, words which are common 
to 14, 15 and 16 textbooks, and words which are common to 16 textbooks. 
The lowest curve in Chart 46 shows that there is practically no difference 
between third- and fourth-year old-type examinations and only negligible 
differences between these and the second-year old-type examinations 
with respect to proportions of words common to three, or fewer, of sixteen 
textbooks. Two of the old-type examinations show proportions of such 
words as high as 17%; one of these is a third-year and the other a fourth- 
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year paper. Two of the four second-year old-type examinations have 
proportions of very rare words of 13%; one third-year and two of the fourth- 
year old-type examinations show proportions of 13%. It is interesting 
to note the differences between examinations for the same year-class in 
different years. In the second-year Regents examination of 1924, 13% 
of the words were common to fewer than four of sixteen textbooks and in 
the 1925 second-year examination only 7% of the words fell in this cate- 
gory. These facts go far towards accounting for the variations in stand- 
ards of the Regents and College Board examinations and also in account- 
ing for the tremendous overlapping of classes. More than one in six of 
the different root words in the 1924 Regents French examination for the 
third-year are words which are common to fewer than four of sixteen 
widely used textbooks; in two of the three fourth-year examinations 
represented in Chart 46 only about one in eight of the different words 
are in this category. We can no longer wonder at the fact that students 
sometimes pass a fourth-year examination on the same day that they 
fail on a third-year examination. That the examinations used by the 
highest and most universally recognized standardizing agencies in the 
country should show such chance variations as these is a notable com- 
mentary on the extent to which American educators have underestimated 
both the difficulty and the importance of constructing sound measuring 
devices. 

The middle curve of Chart 46 shows for each of the examinations the 
proportions of words common to sixteen textbooks, that is, of the first 
134 words according to the Wood French word list. The examination 
which shows the highest proportion of these very easy words is the fourth- 
year Regents examination of June, 1925. The examinations which con- 
tain the lowest proportions of such words are the fourth-year papers of 
the College Board for 1924 and 1925. The Regents 1924 second-year 
paper shows a proportion of these words as low as 25%, which is the size 
of the proportion of the third-year papers of the Regents of 1924 and of 
the College Board of 1925. 

The topmost curve in Chart 46 shows the proportions for each examina- 
tion of words common to 14, 15 and 16 textbooks. According to the 
Wood French word list there are less than 400 words in this category. 
Again we would expect that the fourth-year examinations would contain 
smaller proportions of these words than second-year examinations con- 
tain. The curve does show a slight tendency to slant downward to the 
right but it also shows, notable variations. Fifty-nine per cent of the 
words in the Regents fourth-year paper of June, 1925, and 52% of the 
words in the 1925 College Board fourth-year paper, fall in this category. 
In other words, the 1925 fourth-year College Board French examination 
contains a larger proportion of the first 400 words, according to the Wood 
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Cuart 46. — Proportions of root words in old- and in new-type French examina- 
tions common to indicate numbers of textbooks. This chart is similar to Chart 45, q.v. 


French word list, than did the College Board 1924 second-year French 
examination. It would be superfluous to comment further on these 
variations. The proportions of words from these three categories in the 
new-type French examinations are such as would be expected in a test 
designed to cover the whole range of achievement in modern language 
work. Only about 8% of the different root words in the new-type papers 
are common to fewer than four of sixteen widely used textbooks. In 
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other words, all of the students for whom the new-type tests were designed, 
may reasonably be expected to have been exposed to at least 90% of the 
root words in them. It would certainly be unfair to test students below 
the fourth-year level with examinations containing a larger proportion 
of words not found in three out of four of the textbooks used in French 
classes. Less than a fifth of the words in the new-type tests are from 
the first 134 words, that is, from those common to 16 textbooks; and 
about two-fifths of them are from the first 400 words, according to the 
Wood French word list. A proportion of 20% of these words in an exami- 
nation designed for the whole range of high school modern language 
achievement seems much more reasonable than a proportion of 33% of 
such words in an examination designed solely for fourth-year students, 
such as the 1925 Regents examination in French. 

Empirical difficulties of individual questions in new-type tests. — The 
empirical difficulties of each individual question in the new-type examina- 
tions used in this experiment have been ascertained on the basis of the 
per cents of correct responses from random groups of students ranging in 
numbers from 200 to 1000. The first column of figures at the left of the 
questions in the new-type examinations reproduced above on pages 210 
to 269 shows the per cents of correct responses given to each of the 275 
questions in each examination. The second column shows the order of 
difficulty of the items within each part, and the third column shows the 
order of difficulty of all the 275 questions in each test. Table 40 summa- 
rizes the facts of column 1 for all four of the new-type tests used in the 
Regents experiment. (Cf. p. 209 above.) 

It is apparent from Table 40 that the questions in each part of each 
examination are well distributed with respect to difficulty. Some of the 
items are so easy that 90 to 100% of the students answered them correctly, 
and some are so difficult that correct answers were given to them by less 
than 10% of the students. The sampling of students whose responses 
were used to determine the difficulties of these items include approxi- 
mately equal numbers of second- and of third-year students, and all of 
the fourth-year students available except in the case of French. The 
overlapping of classes, however, makes the distributions as to year classes 
of students used for this purpose relatively unimportant. 

Correlation of empirical difficulties of words with Henmon and Wood 
frequencies. — There may be some who would question the significance of 
the Henmon and Wood French word list frequencies for examination 
purposes. The correlations of the empirical difficulties of the vocabulary 
items with the Henmon and Wood frequencies show conclusively that 
the Henmon and the Wood frequencies (appearing in columns 6 and 7 
at the left of Part I of the new-type French examination) are very signifi- 
cant for examination purposes. The empirical difficulties of the French 
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vocabulary items are based on the responses of 1000 students and the 
correlation of these difficulty ratings with Henmon frequencies is 0.50 
and with Wood frequencies 0.60. The correlation between the Henmon 
and the Wood frequencies for these 100 words is 0.86, and the correla- 
tion between Henmon and Wood frequencies for the 2683 words in the 
Wood list is 0.70. 


TABLE 40 
y 


EMPIRICAL DIFFICULTIES OF INDIVIDUAL NEW-TYPE QUESTIONS. SHOWING DISTRI- 
BUTIONS OF QUESTIONS IN EACH PART OF EACH NEW-TYPE EXAMINATION ACCORDING 
TO PER CENTS OF RANDOM SAMPLINGS OF STUDENTS GIVING CORRECT ANSWERS. THE 
FIRST COLUMN UNDER THE HEADING ‘‘FRENCH’’ SHOWS THAT THERE WAS ONE VOCABU- 
LARY ITEM ANSWERED CORRECTLY BY 100% OF THE STUDENTS; 29 VOCABULARY ITEMS 
THAT WERE ANSWERED CORRECTLY BY 90 To 99% oF THE STUDENTS; 13 VOCABULARY 
ITEMS ANSWERED CORRECTLY BY 80 To 89% oF THE STUDENTS; AND 2 VOCABULARY 
ITEMS ANSWERED BY 10% To 19% oF THE STUDENTS. THE SAMPLINGS OF STUDENTS 
ON WHICH THE PER CENTS ARE BASED INCLUDED STUDENTS FROM ALL THREE YBHAR- 
CLASSES IN THE LANGUAGES; THE SIZES OF THE SAMPLINGS WERE AS FOLLOWS: FRENCH, 
1000; SpantsH, 400; GermMAN, 300; Puysics, 200. 


Per CENTS FRENCH SPANISH GERMAN 
or STUDENTS 
GIVING Puysics 
CorRECT I II III I II III I II III 
RESPONSES | Vocab.| Read. | Gram. || Vocab. | Read. | Gram. || Vocab. | Read. | Gram. 
100 1 i 2 3 
90-99 29 34 10 37 36 11 49 29 8 O 
80 13 20 11 25 21 8 21 18 uf 23 
7 8 8 22 15 6 16 14 12 15 22 
60 3 8 dt 1] 5 11% 10 6 13 24 
50 14 1 12 4 3 if 5 4 ial 32 
40 10 2 sip -f 2 6 2, 15 14 
30 6 1 8 13 105 14 
20 4 5 2 15 1 7 6 
10 2 1 6 1 5 1 8 2 
0-9 4 2 1 


Reliability of orders of difficulty. — It is a generally accepted principle 
that in so far as possible the items or questions in an examination should 
be arranged in the order of difficulty. If the items are arranged in an 
order of difficulty, then, other things being equal, each student will be 
more likely than otherwise to answer all of the questions that he could 
answer within the time allowed. If the questions are not arranged in an 
order of difficulty, students may fail to answer some questions that they 
could answer correctly because they have wasted time trying to figure out 
difficult items that are beyond their powers and that came early in the 
test. It was for this reason, among others, that care was taken in the 
construction of the new-type tests to arrange the items within each part 
in the approximate order of their empirical difficulties. The reader will 
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notice that while in general the items that come first in each part are 
easier than the items which come later, according to the figures in col- 
umn 1, there are many small, and some large, variations or disagreements 
between the order of difficulty and the order of appearance of the items 
in the tests as printed. These disagreements are due partly to mechanical 
errors in the set-up of the examinations for printing, but mainly to the 
unreliability of the order of difficulty of the items as initially established 
by experimental methods. The orders of difficulty in which the items 
were actually arranged in the Regents new-type examinations were in 
some cases based on the responses of less than 200 students. The disagree- 
ments between the orders of difficulty in column 2 and the orders in 
which the items were actually printed, show that the original determina- 
tions of empirical difficulties were not as reliable as had been hoped. 
It therefore becomes important to know more exactly than we know at 
present how many cases are required in order to establish acceptable and 
reliable orders of difficulty. It was for this purpose that the correlations 
of Table 41 were calculated. 


TABLE 41 


RELIABILITIES OF ORDERS OF DIFFICULTY. AVERAGES OF CORRELATIONS BETWEEN 
EMPIRICAL DIFFICULTIES OF THE ITEMS IN EACH PART OF THE NEW-TYPE REGENTS EXAMI- 
NATION IN FRENCH OF JUNE, 1925, BASED ON PER CENTS OF CORRECT RESPONSES IN 
Groups oF 100, 200, 300 anp 500 sTUDENTs. 


Regents New-Type Examination in French of June, 1925 


Part I Part II Parr III 

N 100 | 200 | 300 | 500 || 100 | 200 | 300 | 500 |) 100 | 200 | 300 | 500 

100 838 .896 .909 

200 .905 958 -940 

300 .931 -954 -963 

500 936 .964 975 
No. of 
correla- 
tions 3 2 1 1 3 2, 1 1 3 2 ik 1 
averaged 


Table 41 shows that the reliabilities of orders of difficulty vary both 
with respect to numbers of students on whose responses the orders were 
based and with respect to the character or form of the questions. Thus 
the reliability of an order of difficulty based on the responses of 100 stu- 
dents is 0.838 for the vocabulary items of Part I of the French test, 0.896 
for the true-false reading items of Part II and 0.909 for the grammar- 
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completion items of Part III. The reliabilities of the orders for all three 
parts increase gradually and fairly regularly as we increase the numbers 
of students on whose responses the orders were based, but the differences 
between the parts are not eliminated by increase in the number of cases 
up to 500. The reliability of an order of difficulty based on 500 cases is 
0.936 for the 100 items of Part I, 0.964 for the 75 items of Part II, and 
0.975 for the 100 items of Part III. The general conclusion is that orders 
pf difficulty based on responses of 500 students are sufficiently reliable 
for all practical purposes, although larger numbers of cases should be used 
for vocabulary items wherever possible. 

Validity of individual new-type questions. — The validity of any ques- 
tion is measured by the degree to which it correlates with acceptable 
criteria of achievement. If poor students answer a question correctly as 
often as, or oftener than, good students answer it, the question is obvi- 
ously not valid. In other words if a question uniformly gives more 
credit points to second-year students than it gives to fourth-year stu- 
dents, it hinders rather than promotes the purposes of the examination. 
The fourth column of figures at the left of the new-type examinations 
reproduced above on pages 209 to 282 shows for each question the differ- 
ence between the per cent of fourth-year students and the per cent of 
second-year students that answered the question correctly. 

Most of these differences are positive, showing that larger proportions 
of fourth-year, or highest-quarter, students answered the questions cor- 
rectly than of second-year, or lowest-quarter, students. A few of the 
differences are negative showing that larger proportions of second-year, or 
lowest-quarter, students answered the questions correctly than of fourth- 
year, or highest-quarter, students. Earlier in this report it was stated 
that no items were used which did not give positive correlation with accept- 
able criteria of achievement. Table 42, which summarizes the facts pre- 
sented in column 4, shows that 23, or a little over 2%, of the 969 questions 
in the four new-type examinations used in this experiment show negative 
differences; that is, give more credits to poor students than to good students. 

The inclusion of questions showing negative differences was due in part 
to those accidents and mechanical mistakes which seem inevitable in 
handling large masses of data, and in part to the unreliability of the 
original determinations of the validities of the items used. The number 
of ‘“‘bad’’? questions is so small in relation to the total number of questions 
in the examinations that they had only a negligible influence on the 
scores of the students. Thus there were only 9 questions of 275 in the 
ee ee aires up aute iia who ativened aman cee 


2 Cf. above pp. 87 for definitions of “good” and ‘“‘bad”’ questions. The discussion here closely par- 
allels the treatment of the validity of individual questions in the junior high school French and Spanish 


tests, pp. 87 to 91, above. 
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German examination showing zero or negative differences; and an equal 
proportion in the French examination. There were only 3 such questions 
in the 275 items in the Spanish examination and only two in the 144 
items in the physics test. As nearly 98% of all the items of the four 


TABLE 42 


VALIDITIES OF INDIVIDUAL NEW-TYPE QUESTIONS. SHOWING DISTRIBUTIONS OF 
QUESTIONS IN THE FRENCH, SPANISH AND GERMAN EXAMINATIONS ACCORDING TO DIF- 
FERENCES BETWEEN PER CENTS OF FOURTH-YEAR AND PER CENTS OF SECOND-YEAR STU- 
DENTS ANSWERING CORRECTLY, AND IN THE PHYSICS TEST ACCORDING TO DIFFERENCES 
BETWEEN PER CENTS OF HIGHEST-QUARTER AND PER CENTS OF LOWEST-QUARTER STU- 
DENTS ANSWERING CORRECTLY. THE COLUMN AT THE LEFT INDICATES DIFFERENCES 
BETWEEN PER CENTS OF STUDENTS IN THE FOURTH YEAR MODERN LANGUAGE CLASSES 
ANSWERING CORRECTLY AND PER CENTS OF STUDENTS IN THE SECOND-YEAR CLASSES 
ANSWERING CORRECTLY, AND BETWEEN PER CENTS OF HIGHEST-QUARTER PHYSICS STU- 
DENTS ANSWERING CORRECTLY AND PER CENTS OF LOWEST-QUARTER PHYSICS STUDENTS 
ANSWERING CORRECTLY. 'THE NUMBERS OF CASES ON WHICH THESE DISTRIBUTIONS 
ARE BASED ARE THE SAME AS IN TABLE 40, 


% Correct FRENCH SPANISH GERMAN Puysics 
ANSWERS IN 
4TH YEAR |~ 
Minus 2ND I Il Ill I II EL Toes | |e II Ill 
YEAR CLass | Vocab.| Read. | Gram. |] Vocab.| Read. | Gram. || Vocab. | Read. | Gram. 
90-99 1 
80 1 1 
70 3 4 1 
60 9 1 u 4 1 5 1 
50 14 2 6 8 5 6 2 1 6 
40 15 9 20 13 3 24 6 4 5 21 
30 14 8 21 16 12 24 18 12 17 32 
20 11 20 18 21 10 22, 16 vf 30 37 
10 15 14 12 25) 3 13 13 21 28 39 
5-9 6 11 5 8 10 3 14 15 ile 6 
1-4 9 6 5 2 2 23 10 3 1 
0 1 1 2 1 
—1-4 1 i! 1 Zz 2 1 
-5-9 1 1 1 2 1 
-10 1 1 
—20 U 
—30 1 


examinations operated in the right direction and more than 82% of all 
the questions show positive differences of 10 or more, the percentage of 
efficiency is fairly acceptable. 

Need for empirical verification of measurement values of individual ques- 
tions. — The data on “goodness”! or validity of individual questions 
afford striking evidence of the need for great care in the construction of 
examinations which are designed to give accurate and meaningful informa- 


1 See above pp. 87-91. 
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tion about defined achievements of students. That a few “bad” ques- 
tions were included even in these new-type tests, in spite of consistent 
appeal to objective experimental evidence on their difficulties and validi- 
ties, indicates that in depending on a committee of three or four notable 
scholars to produce in one afternoon an examination which is to serve as 
a state-wide standardizing instrument, we have seriously underestimated 
the task. 

+ It should be stated that our treatment of the validity of individual 
questions is necessarily very incomplete in this report. In order that a 
question be really good it must not only differentiate between second-year 
and fourth-year classes, or between lowest- and highest-quarter students; 
it must progressively give higher and higher per cents of correct responses 
as we go from first- to fourth-year classes, or from lowest- to highest- 
quarters of student groups, whether in the same or in different year- 
classes. In other words, the per cents of correct answers should increase 
from lowest to highest classes as indicated in the curves at the top of 
Chart 47, without irregularities or inversions at any point. 

The vast majority of the questions in the new-type tests do give curves 
like those at the top of Chart 47; but, in addition to the 23 questions 
already mentioned which give curves like those at the bottom of this 
chart, there are a few which, in spite of showing positive differences as 
between second- and fourth-year classes, show negative differences as 
between second- and third-, or between third- and fourth-year classes. 
At first thought it seems impossible that any reasonable question could 
be framed which would be correctly answered by mediocre oftener than 
by superior students; yet many questions of both the old and the new 
types behave in this curious manner. Very little is known as to the 
precise cause or causes of such anomalies, but they undoubtedly lie in 
subtle ambiguities in the form, content or context of the questions, — 
ambiguities which mislead superior students but which are too fine to 
be noticed by students who are still in the elementary stages or are dull 
or both. It is to enable those interested to study the characteristics of 
both good and bad questions, with a view to learning how to cultivate 
the former and avoid the latter, that we have indicated the ‘order of 
goodness” of all the questions in each of the new-type tests used in this 
experiment. This order of goodness is indicated in the fifth column 
of figures at the left of the questions in the tests reproduced on pages 
210 to 282. 

The reader may find considerable amusement, and perhaps profit in 
indulging in a guessing game with the figures in the third and fifth columns 
at the left of the questions in the new-type tests. He may, for example, 
learn something of his ability to judge the difficulty of questions by se- 
lecting any two items in one of the tests, deciding which is the more 
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difficult, and comparing the decision with the figures in column 3. Or 
he may guess at the relative differentiating powers, i.e. validities, of two 
questions and compare his judgments with the figures in column 5. 

The easiest of the 275 items in the new-type French test is faire, No. 2 
in Part I, and the hardest is No. 100 in Part III; the best question is 
No. 47 in Part III, and the worst question is No. 72 in Part II. A study 


Foren) Coren 
Paleo | mrrenieh peepee 
Part | Part Il Part Ill Part | Part Il 
Question 75 Question 50 Question 61 Question 94 Question 75 


Question 99 Question 72 Question 81 


0 
| Year-Classes | oom Vi dom Vf uo om ov on oe oto 


Cuart 47. — Illustrating graphically the validity of individual questions in the new- 
type examinations in French, Spanish, German and Physics. The graphs at the top 
of the chart relate to the good questions and those at the bottom to bad questions. The 
chart is to be read as follows: Item 75, in Part I of the new-type French examination, 
was answered correctly by 31% of French II students, by 72% of French III students, 
and by 79% of French IV students. The corresponding per cents for Item 99 in Part I 
of the French test are 86, 84 and 82%. The graphs for the questions in the other modern 


of some of the bad questions will show obvious defects in them. For 
example, No. 79 in Part I, sol (= soil), contains two very tricky alterna- 
tives, sun and alone. Number 81 in Part III shows a negative difference 
which is probably due to’a slight inadequacy in the scoring key. There 
are several other questions showing defects that are more or less obvious 
at the present writing but which escaped detection before the tests were 
printed. For obvious reasons, the tests are reproduced on pages 210 to 
282 exactly as they were presented to the students in June, 1925, as parts 
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of the Regents examinations. The identification of the easiest and hard- 
est, best and poorest questions in the Spanish, the German and the Physics 
tests is left to the interest of the reader. (Cf. p. 209 above.) 

Validity of individual old-type questions. — The three most heavily 
weighted questions in the old-type modern language examinations are 
(1) the foreign language-to-English translation, (2) the English-to-foreign 


Boreas Pace) am ee ives ee 


Part III 
Question 67 Question 51 Question 110 


Question 96 Question 96 Question 61 


language tests are to be read in the same way. The graphs at the extreme right show that 
question 110 of the physics examination was answered correctly by 29% of the lowest- 
quarter students, by 62% of the second lowest-quarter students, by 80% of the third- 
quarter students, and by 89% of the highest-quarter students; and that question 61 was 
answered correctly by 62% of the lowest-quarter students and by 55% of the highest- 
quarter students. 


language translation, and (3) the free composition questions. The inter- 
correlations of these three questions in the French II and Spanish II 
examinations, and also their correlations with total new-type scores, are 
shown in Table 43. ; 

The English-to-foreign language translation question gives the highest 
correlation with new-type scores in both the French IT and the Spanish I 
examinations; the lowest correlation with new-type scores is given by the 
free composition question in French II and by the Spanish-to-English 
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translation question in the Spanish II examination. The intercorrela- 
tions of the three French II questions average about 0.54 and of the 
three Spanish II questions about 0.37. When we recall that these three 
questions in the old-type papers account for about two-thirds of each old- 
type examination, their intercorrelations seem much too low. It seems 
probable to us that while the lowness of the relationships is partly due 


TABLE 43 


CoRRELATIONS OF TOTAL NEW-TYPE EXAMINATION SCORES WITH THE OLD-TYPE 
TRANSLATION AND FREE COMPOSITION QUESTIONS, AND INTERCORRELATIONS, MEANS AND 
SIGMAS OF THE LATTER, BASED ON RANDOM SAMPLINGS OF FreNncu II AND SpanisH II 
STUDENTS WHO TOOK THE REGENTS EXAMINATIONS IN JUNE, 1925. THE CORRELATIONS 
IN CoLuMN (2) SHOW THAT THE ENGLISH-TO-FOREIGN LANGUAGE TRANSLATION QUES- 
TIONS ARE THE MOST VALID OF THE THREE OLD-TYPE QUESTIONS IN BOTH LANGUAGES; 
BUT THE LARGEST NUMBERS OF CREDIT POINTS ARE ASSIGNED TO THE TRANSLATIONS 
IN THE OPPOSITE ORDER. THE FREE COMPOSITION QUESTIONS ARE THE MOST HEAVILY 
WEIGHTED, AS SHOWN BY THEIR GREATER SIGMAS, IN COLUMN (6). 


(1) (2) (3) (4) (5) (6) (7) 
Nuw- OLD-TYPE 
Orpsryen ERence U1 typE | FRENCH | ENGLISH QUESTIONS: NUMBER 
EXAMINATION QUESTIONS FRENCH Ee ge oe eee ee on 
Exam. ENGLIsH| FRENCH CASES 
Mean Sigma 
French-to-English translation (25 
credits) . 0.586 19.94 3:32 1110 
English-to- French translation (20 
credits) . . 0.600 | 0.508 12.03 3.61 1110 
Free composition (or oral credit) 
(Okcredits)i= ma eee ee en OACON TO L542 OSS ale te Ont 4.15 1110 
Nuw- |e a : | OLD-TYPE 
Oraeteaai eon Il oe roe aes QUESTIONS NUMBER 
EXAMINATION QUESTIONS SPANISH ENGLISH SPANISH = eS Che 
Exam. Mean Sigma 
| 
beac ose translation (25 
credits : 0.464 20.55 2.4 971 
English-to-Spanish translation (20 
credits) : 0.614 | 0.3868 13.56 2.8 971 
Free composition (or oral credit) 
(20Kcredits) ee || 052380 0: 830N 0422 15.07 3.0 971 


to the independence of the three functions tested by the three questions, 
it is largely due to the unreliability of each of the three old-type question 
forms. This unreliability is due primarily to the inadequate sampling 
of modern language materials and of student performances in such old- 
type questions, and partly to the subjectivity in the scoring of the stu- 
dents’ responses. We have shown in preceding tables and charts that the 
vocabulary samplings of the old-type examinations are not only meagre 
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but are very heterogeneous with respect to difficulty or frequency of 
occurrence. The presence of only three or four fifth-thousand words in 
a passage which second-year students are asked to translate might easily 
cause an excellent French II student to make a very poor showing — 
especially when the scoring is subjective. 

General weaknesses of translation and free composition questions. — The 
translation and free composition questions in the old-type examinations 
suffered also because of the general nature and indefiniteness of the abili- 
ties and skills which they measure. The translation questions in both 
directions require the ability to read both languages as well as to write 
in both languages. Therefore, in translating a foreign passage into 
English, the failure of the student may be due more to poor expression 
in English than to lack of understanding of the foreign passage, and failure 
to translate an English passage into the foreign language may, in multi- 
lingual communities, be due as much to difficulty in understanding the 
English passage as to inability to express ideas in the foreign language. 
The free composition question tests not only ability to express ideas in 
terms of the foreign language but several other more or less specific abili- 
ties which we may designate by the term usually employed, “imagina- 
tion.”” Some modern language scholars hold that the complexity of 
abilities implied in old-type examinations is the prime merit of old-type 
forms of questions. That these abilities are all exceedingly important is 
undeniable, but it is a serious question whether it is the function of mod- 
ern language examinations to measure them. The task of measuring 
simple and unadorned knowledge of the foreign language itself is a large 
enough task for any single examination; it is hardly wise or reasonable to 
expect modern language examinations, which are known to be imperfect 
in measuring modern language ability, to measure also general intelli- 
gence, imagination and spiritual well-being. These qualities are just as 
significant for other departments of instruction as for modern languages; 
moreover, they are important enough to merit separate and very elaborate 
examinations. The attempt to measure these general qualities in terms 
of specific subject matter examinations is the main cause of the general 
meaninglessness of most school marks. Some French teachers may feel 
entirely justified in encouraging a gifted and imaginative youth by giving 
him a high grade in French in spite of poor achievement in mastering the 
language itself; or in flunking a youth who knows the language well but 
who lacks imaginative fire and aesthetic perspective; but the main effect 
of such confusion of meaning of grades is to make the intelligent educa- 
tional guidance of students almost impossible. The fundamental purpose 
of school grades is to convey in universally meaningful terms, accurate 
information concerning defined achievements and traits. Examinations 
given in elementary modern language courses should confine themselves 
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to the task of describing accurately and meaningfully the student’s mas- 
tery of the language itself; to attempt to go further than this in such 
examinations is an ambitious assumption of a duty which belongs to 
other parts of the educational system of the state. 

Fallacy of substituting oral credit for free composition questions. Pek 
more specific example of the indefinite meaning of grades based on old- 
type examinations is found in the dual character of the free composition 
question. According to the Regents rules, oral credit, as determined by 
the local schools, may be substituted for the full value of the free composi- 
tion question. Out of a little over 1100 old-type French II papers used 
in deriving the figures of Table 43 more than 600 showed that oral credit 
of from 16 to 19 points had been substituted for the possible 20 points 
of credit on the free composition question. No paper was found in the 
1100 in which the full 20 points of possible credit on the free comnosition 
question was given for oral work. 

It is well known that the oral and aural skills may be relatively inde- 
pendent of achievement in the written language. It is often asserted by 
those entitled to an opinion that the oral-aural work has materially im- 
proved in New York State and elsewhere since the introduction of oral and 
aural examinations. A student might be able to write a fairly respectable 
composition in French of 150 or 200 words without being able to pronounce 
correctly all or a majority of the words that he has written. In determining 
oral-aural credit some teachers base their marks almost wholly on mechani- 
cal perfection of pronunciation, while others put considerable emphasis 
on content in the student’s foreign language speech. Thus the grade on 
the old-type Regents examination is a variable measure of a variable 
complex of dependent and of independent abilities and skills. Whenever 
oral credit is substituted for the free composition question on a paper, 
it means that the direct application of the Department readers’ standards 
is impossible for a fifth of the total possible credit on a paper. The 
variable emphasis which teachers give to oral work mentioned above is 
responsible to no small extent for the variations in individual school 
standards, which, as described earlier in this report, remain partially 
uncorrected by the Department reviewing. 

Special weaknesses of free composition question. — There is no form of 
question and no aspect of a foreign language for which the direct applica- 
tion of centralized reviewing is more needed than for the free composi- 
tion question and the oral-aural skills. ‘In reading free compositions and 
in judging oral-aural skills the subjective judgment of the rater has a free- 
dom which is practically beyond restriction in the absence of directed 
centralized reviewing. It is unfortunate that the beneficent influence of 
the State Department readers in reducing the variable standards of the 
local schools is least effective in that part of the modern language examina- 
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tion in which such reviewing is most needed. The freedom of the teacher 
to substitute oral credit for the free composition question is probably a 
large factor in producing the low correlations for the free composition 
question displayed in Table 438. 

The special weaknesses of the free composition question are too well 
known to need comment. If the topics assigned are kept within the 
range of what could be reasonably expected of students, such as, “A Day 
jn the Country,” “A Trip to the Museum,” “A Voyage,” “My Family,” 
“My Home,” “The School Room,’ etc., the students may be unfairly 
coached. On the other hand, if in choosing topics the examiners go 
beyond the range of coaching they may easily go beyond the range of 
the students’ foreign language vocabulary or beyond his experience, or 
beyond both. 

It has been argued by some that the new-type tests are of little value 
because they do not measure free composition ability and the oral-aural 
skills. The answer to this is that they do measure these abilities and 
skills indirectly as accurately as, and probably more accurately than, 
the traditional tests measure them. But, admitting the charge that the 
new-type forms of tests do not measure these things, the only conclusion 
that can be drawn is that we must search for means of measuring speech 
and hearing factors as accurately as the new-type tests measure ability 
with the written language. Since the two kinds of abilities are relatively 
independent, our measures of them should be kept separate in any case. 
To declare that the new type is of little value because it does not measure 
oral-aural skills is ike throwing away a dollar because it is not two dol- 
lars; also, to fail a student in a French course merely because he cannot 
pronounce would be just as confusing as to report an over-weight child as 
under-weight merely because he is under the average height for his age.! 

Irrelevant activities in old- and new-type examinations. — The funda- 
mental reason for the inferiority of the old-type examinations is that they 
involve so many irrelevant activities that only a fraction of the examina- 
tion is really expended in displaying the type of achievement which the 
examination is supposed to measure. The fact that everything in the 
student’s answers must be written out in legible longhand in itself places 
strict and narrow limits on the samplings of modern language materials 
which examiners may include in their examinations. Moreover, the 
activity of writing in longhand involves many opportunities for purely 
mechanical mistakes, and is a potent distraction to the student which 
disturbs him even in those moments when he is actually not moving his 
pen. Translation questions are particularly inefficient mirrors of modern 
language achievement because in addition to the weaknesses just mentioned 
they are not flexible enough to permit of detailed adjustments of size and 


1 Cf. pp. 96 to 98, above, on the limitations of the new-type tests. 
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character of vocabulary and other samplings and because the activity of 
translating itself involves special kinds of skills and abilities not particu- 
larly relevant to genuine modern language teaching goals. Translating 
may be a good form of discipline, but in view of the fact that so few of our 
modern language students become translators, surely more directly useful 
ways of disciplining young minds can be found, especially since so large 
a proportion of modern language students never go beyond the second- 
year class. Translation questions necessarily involve repetitions of words 
and phrases, many of which have little if any measurement value. 

Table 44 shows that an average of about 30% of the different root 
words in both the old- and the new-type examinations are used twice or 
oftener in the same or in inflected forms. In the old-type examinations 
nearly 10% of the different root words are used five times or oftener, 
while only about 7.5% of the different root words in the new-type tests 
are repeated five or more times. Even if the number of repetitions were 
greater in the new-type than in the old-type the waste of time and energy 
would still be greater in the old-type examinations because in them every 
word, whether repeated or not, has to be written out, whereas in the 
new-type they have only to be read. Thus in the old-type three-hour 
examinations the student has to write at least 600 words of running 
discourse in order to expose himself to a sampling of about 225 different 
root words; while in the new-type 90-minute examinations the student 
has to write only 175 digits or symbols and about 125 words in order to 
react to about 500 different words. In some of the old-type examinations 
words like le and de have to be written out 40 or 50 times each, although 
they apparently have real measurement value in only five or six of the 
contexts in which they occur. The wastes of repeating the same words 
and the distraction of writing everything down in longhand are inherent 
in the old-type form of examination. Even with the utmost care, 
examiners will probably never be able to reduce such wastes greatly, 
as long as they confine themselves to old-type question forms. It is 
unnecessary to repeat here that the old-type examinations referred to in 
Table 44 were constructed with extraordinary care by examiners of many 
years’ experience in constructing old-type examinations. 


VI 


» COSTS OF OLD- AND OF NEW-TYPE EXAMINATIONS 


On account of the sample method of reviewing Regents examinations by 
the Department readers, and the impossibility of finding the exact number 
of papers which are actually reviewed, we are unable to compare the costs 
of new- and old-type Regents examinations used in this experiment. 
The annual reports of the Secretary of the College Entrance Examination 
Board include exhaustive statistics on the total and distributed costs of 
examinations in each subject matter. Since the Regents and College 
Board examinations are very similar in method and carefulness of con- 
struction and scoring, we believe that a comparison of College Board 
costs with new-type Regents costs will be a fair substitute for a compari- 
son between the costs of the new- and old-type Regents examinations. 
With this in view, Table 45, showing relative costs of old- and new-type 
examinations in French, Spanish, German and Physics, has been prepared. 
The lowest per-paper reading cost for the old-type modern language 
examinations shown in Table 45 is about eight times larger than the high- 
est per-paper cost of the new-type. Experience has shown that the 
cost of constructing new-type examinations would be about the same as 
that of constructing old-type papers under the present system, and that 
the cost of printing new-type papers will not average above one cent per 
paper over a series of years. 

The actual cost of the State Department reviewing of old-type papers is, 
of course, considerably lower than is indicated by the figures of Table 45. 
This is due to the fact that the State Department readers do not review 
every paper sent in by the schools; they read varying proportions of the 
papers from different schools and accept the school ratings on the re- 
mainder of the papers. The data presented in previous sections of this 
report indicate that the sample method of reviewing is a fundamental 
weakness in the Regents system. It therefore becomes of interest to 
learn what the cost would be if the State Department readers should give 
every Regents paper (e.g. in French) a genuine independent reading un- 
influenced by knowledge of the school rating. There were nearly 21,000 
papers in French II, III and IV in June, 1925. According to column 3 
in Table 45 the number of French papers read per hour is 2.4. Calling 
this average 3 for the sake of avoiding a fraction and dividing it into 21,000, 


it appears that 7000 working hours or 1120 working days would be neces- 
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sary to score the Regents French examinations of June, 1925, if the latter 
were as long as the College Entrance Board French papers. Putting it in 
terms of dollars, the cost of scoring 20,000 French papers at the lowest 
rate reported by the College Entrance Examination Board would be over 


TABLE 45 


RELATIVE COSTS OF SCORING OLD- AND NEW-TYPE EXAMINATIONS IN FRENCH, 
SPANISH, GERMAN AND Puysics. ContumMN (1) SHOWS THE AVERAGE OF THE LOWEST 
“(PER-PAPER COSTS OF READING”? REPORTED BY THE COLLEGE ENTRANCE EXAMINATION 
BoarD FOR THE YEARS 1920, 1922 and 1925 IN THE RESPECTIVE SUBJECT MATTERS. 
Cotumn (2) SHOWS THE AVERAGES OF THE HIGHEST ‘‘PER-PAPER COSTS’? REPORTED 
BY THE BoARD FOR THE SAME YEARS. ‘THESE FIGURES ARE FOR “READING”’ ONLY, 
EXCLUSIVE OF ALL OTHER EXPENSES INCLUDING SUPERVISION. ‘THE FIGURES FOR THE 
NEW-TYPE SCORING Costs IN CoLUMN (4) INCLUDE TECHNICAL SUPERVISION, CHECKING 
ALL SCORING AND COUNTING, AND TRANSFERRING SCORES TO THE REGENTS BooKLeETs. 
OVER THREE-SEVENTHS OF THE COST OF SCORING THE NEW-TYPE MODERN LANGUAGE 
EXAMINATIONS IS ACCOUNTED FOR BY Part III, THE GRAMMAR-COMPLETION TEST. THE 
COST OF PRINTING NEW-TYPE PAPERS WILL NOT AVERAGE ABOVE ONE CENT PER PAPER 
OVER A SERIES OF YEARS. THE COST OF CONSTRUCTING NEW-TYPE EXAMINATIONS 
OVER A SERIES OF YEARS WILL BE ABOUT THE SAME AS THE COST OF CONSTRUCTING OLD- 
TYPE UNDER PRESENT CONDITIONS. CoLUMN (3) SHOWS THE AVERAGE NUMBER OF 
PAPERS READ PER HOUR BY THE AVERAGE COLLEGE BOARD READER. 


CoLLEGE ENTRANCE EXAMINATION BoaRD New-Type 
Stussncrs Bet Pepe hips Fg 
SgeGS MOST | Number of | Perpoper 
per hour scoring 
Low High 
Columns (1) (2) (3) (4) 
Peenene Pr eG a 1: $0.71 $0.92 2.4 $0.07 
Semen . & Gg wep & 0.54 1.07 3.0 .07 
(Germangee ewer? el, ne 1 Ui L383 i .07 
Averages for Modern Lan- 
UALCSEe Ne Ae eee) es $0.78 $1.24 2.37 $0.07 
Physics sat oe oe $0.43 $0.43 4.1 $0.03 


$14,000; while a similar number of new-type French papers could be 
scored for $1400 or less. These differences are too large to need comment. 
The important finding of this report, however, is that regardless of rela- 
tive costs, the objective test furnishes the only known means by which 
the State Department can do that for which the Regents examinations 
were instituted, namely, furnish the state educational system with accu- 
rate measures of defined achievements expressed in comparable units and 
reckoned from stable and uniform standards. 


VII 


CONCLUSIONS AND RECOMMENDATIONS 


Space limitations do not allow us to describe the Regents system in 
detail. In earlier sections certain features of the Regents system have 
been mentioned, notably in connection with Table 21 on pages 133 to 
136 above. The background and significance of the present experiment 
will be more adequately appreciated if the general features of the Regents 
system are understood, especially since so many erroneous notions con- 
cerning the Regents examinations are entertained even by teachers within 
New York State. For this purpose we shall quote excerpts from an 
article appearing in “The University of the State of New York,” Bulletin 
to the Schools, Vol. 12, No. 2, Oct. 1, 1925. 

History and methods of the Regents examinations. — “Regents aca- 
demic examinations were first given.in 1878 in response to concerted action 
by the academic principals of the State who urged the Board of Regents 
to establish a series of outside objective tests in secondary subjects for 
the purpose of measuring the work of their schools and of establishing 
proper standards of scholarship. Until 1906 these examinations were 
made in the Regents office, under the direction of the secretary of the 
Board, by a staff of inspectors and examiners, small at first but gradually 
enlarged as the schools grew and the use of the examinations became 
more widespread... . 

“In 1906. . . the Regents of the University created a State Examinations 
Board to serve as a liaison agency between the schools and the central 
administration of the University. .. . 

“This Board has twenty members: fifteen representing the colleges, 
superintendents and secondary schools, and five representing the Depart- 
ment. Ever since the day of its establishment each revision of our courses 
of study has been made upon its recommendation and by committees of 
its selection. Similarly, the examinations based upon these syllabuses 
have been prepared under its direction by committees of teachers chosen 
from various types of high schools and colleges in the State. 

“The making of Regents examinations is not, then, merely a matter 
of office routine. They represent the collective thinking of the teachers 
of the State. In each major field of study, such as history or English, a 
committee, usually of four members, of which only one is a member of 


the State Department, prepares the question papers in that field. These 
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papers are therefore an expression of the experience and judgment of 
teachers actually engaged in teaching or in supervision. On those ques- 
tion committees the Department has had only minority representation 
and has sought to have associated in the performance of these highly 
important duties the strongest men and women in the field of collegiate 
and of secondary education. 

“The papers thus prepared by the several committees are subjected to 
critical examination for form by the editor of examinations and are then 
submitted for final revision to a committee, usually of seven members, 
appointed by the State Examination Board. Regents question papers 
are therefore the product of thoughtful consideration, over a considerable 
period of time, by competent teachers experienced in the classroom or in 
supervision. 

“Tt is doubtful whether a more democratic scheme of codperation in 
the preparation of Regents examinations could be devised... . 

“There are two respects in which this scheme of statewide tests is 
more flexible and has created higher standards; namely, (1) the prepara- 
tion of papers by the committee system in which there is a full participa- 
tion of those actually engaged in instruction, and (2) the development both 
in the schools and in the Department of a standardized scheme for rating. 

“The board and its committees have not been unmindful of the 
new theories respecting educational measurements. It is constantly 
seeking methods of improving the examination system. It is the desire 
of the board and of the Department to adopt new measures whenever 
results warrant, as is evidenced by recent changes in the form of certain 
papers and by experiments with a new type of test for which greater 
accuracy and greater ease of rating are claimed. As illustrative of this 
may be cited certain changes in the form of papers set in English, the 
incorporation of a silent reading test, the inclusion of true-false, recogni- 
tion and completion questions as parts of papers in other subjects and 
recasting of algebra papers to provide a simpler mechanical arrangement 
for more rapid rating. 

“With the improvements in the form and in the content of examinations 
which have been brought about in recent years, there has come an equally 
important improvement in the method of rating answer papers. In most 
of the larger schools departmental organization is such that answer papers 
in a given subject group may be rated by the committee system. This 
in large measure eliminates the personal equation that sometimes makes 
the work of the individual reader unsatisfactory. The committee system 
also obtains in the work of the examining staff of the State Department 
particularly in the summer when the force of readers is largely augmented. 
The quality of the personnel engaged in reading papers at the State Edu- 
cation Building has been greatly strengthened. During the summer over 
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160 experienced teachers are called into the service to assist the permanent 
examiners. These are college graduates with evidence of sound judgment 
and with a minimum teaching experience of three years in the subject which 
they examine. During these years, by a careful process of selection and 
of testing there has been built up a staff of summer examiners of unusual 
merit representing in each field a group of the best teachers in the 
State... . 

» “...In New York City the examinations are used for supervisory 
purposes in order to reveal points of strength or weakness in instruction 
both in the schools as a whole and in separate groups. . 

“Regents examinations have played an important part in the educa- 
tional history of the State and while they have sometimes been adversely 
criticised, such criticism usually has been made by persons uninformed or 
biased. Generally these tests have been regarded as an important fac- 
tor in maintaining proper educational standards not only in the large’ 
city high schools but in the small high schools in rural communities.”’ 

The Regents system sound in principle. — To some readers who have 
carefully digested the data of this report, it may appear that the preceding 
excerpts fall somewhat short of giving a complete description or a fair 
appraisal of the actual workings of the Regents system. However the 
reader may judge the merits of the appraisal implied in these excerpts, 
the writer is convinced that, all things considered, the Regents system is 
fundamentally sound in principle. In preceding pages of this report we 
have exposed by means of objective evidence the inadequacies of the 
examinations used by the Regents and have presented a brief analysis 
of what seemed to be the main causes of these inadequacies. But it can- 
not be too strongly emphasized that these weaknesses arise from the 
particular forms of examinations used and from certain details of pro- 
cedure, none of which are necessarily inherent in the Regents system. 
In other words, we find no evidence from any source which calls in ques- 
tion the underlying principles of the Regents system. The purpose of 
examinations is to give accurate measures or meaningful descriptions of 
defined achievements and to express such measures in terms which can 
be understood by all competent educators, to the end that such informa- 
tion may be used constructively in guiding students educationally and in 
administering the educational system to the best advantage of all con- 
cerned. In order to fulfil this purpose it is necessary that all the educa- 
tional measurements within a given educational system be expressed in 
terms of comparable units reckoned from uniform and stable standards 
or points of reference. 

Destructive criticisms of Regents system not justified by facts. — The 
Regents system of examinations, like that of the College Entrance Exami- 
nation Board, has been severely criticized in some quarters as being wrong 
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in principle, reactionary, autocratic, technically faulty and extravagantly 
costly. Dr. Kruse’s report on “The State’s System of Examinations” 
in The Rural School Survey of New York State, Volume 1, Part 5, is a note- 
worthy example of extremely unfavorable criticism. 

The present report lends support to only one of Dr. Kruse’s conclusions, 
namely, that the examinations used are technically faulty. We find no 
evidence supporting his other conclusions, but on the contrary find that 
his recommendations looking towards abandonment or weakening of the 
Regents system are unsound and destructive rather than constructive. 
The old-type examinations are admittedly defective and unworthy both 
of the confidence reposed in them and of the system in which they are 
used. But the principle of uniform centrally administered examinations 
is sound and is indispensable to the wise administration of our educational 
system. Every charge which has been made against the Regents exami- 
nations holds with multiplied force against locally derived examinations 
which Dr. Kruse and others seem to recommend as a substitute. There 
can be no doubt about the existence of grave imperfections in the Regents 
and College Entrance Examination Board examinations; but to abandon 
them in favor of locally derived examinations would merely exaggerate 
existing imperfections and add many others without retaining any of the 
good features which the carefully derived and administered Regents and 
College Board examinations possess. 

Duties of examining agencies include research and information ser- 
vices. — Nor is there any danger essential in centrally administered systems 
of examinations so long as progress is not forestalled by the creation of 
vested interests and autocratic authority. It seems that sensitiveness to 
new conditions and to the demands of sound progress are assured in the 
Regents system by the distribution of authority and by the democratic 
composition of the various advisory boards and committees that are 
responsible for the construction and supervision of the Regents examina- 
tions in the various subject matters. To some advocates it may seem 
that the centralized examining agencies, including the Regents, have 
failed to keep pace with the progress which research scholars have made 
in recent years in improving instruments for the measurement of mental 
capacities and educational achievements. The first duty of executive 
officers in charge of examining agencies is to administer the available 
instruments which command the confidence of the educational system. 
But a fully coérdinate duty is that of improving available instruments 
and of evaluating improvements suggested by research workers every- 
where, and of keeping their educational constituencies adequately in- 
formed about such improvements. One aspect of this latter duty is a 
continuous and aggressive exposure, and a fearless publication, of what- 
ever defects exist in the examinations and procedures that are in force 
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and that command the confidence of the public at any given time. An 
official in charge of an examining agency who has failed either to ascertain 
and publish the weaknesses of the examinations administered, or who has 
failed to inform himself accurately and precisely about the scientific 
evidence back of suggested improvements in examining agencies has been 
guilty of a serious dereliction to duty. That such charges do not apply 
to the Regents system is evidenced by the large number of experiments, 
including the present one, that have been undertaken by the Regents 
examiners in recent years in efforts to verify and to profit by the values 
of new types of examinations. The experiment with which this report 
is concerned is only one of many bits of evidence to the effect that those 
in charge of the Regents examinations are willing to expose the weak- 
nesses of the examinations used and are anxious to adopt any proposed 
modifications, the values of which for the Regents system can be scientif- 
ically and fairly demonstrated. 

The time-serving conception of achievement is indefensible. — If the objec- 
tive and standardized tests are adopted for the modern languages it will 
be possible to define year-classes or credits in terms of actual achieve- 
ment rather than in terms of length of time spent “taking” modern 
language work. It is due to an exaggeration of the importance of clock 
time spent in class as a condition for — and measure of — achievement, 
and to a lack of appreciation of individual differences, that we have so 
long reposed confidence in a system of separate subjective examinations 
for each year-class; and that the incredible overlapping of classes and 
misplacement of students, exposed in this report, have continued so long 
undetected and uncorrected. The retention in the French II class of the 
300-odd students who in actual achievement are above the fourth-year 
average is a clear example of the evil workings of the old time-serving 
conception of education which has persistently ignored the fact of indi- 
vidual differences. It is only one of many examples that might be cited 
wherein the rules of examining agencies put a premium on stupidity and 
laziness and a penalty on intelligence and industry. Regents rules, like 
those of the College Entrance Examination Board, prejudge every mod- 
ern language student without trial by jury or any other of the amenities 
which the law accords to the worst criminals, and sentences them to a 
fixed number of years in modern language classrooms; until the student 
has done time to the full extent of his sentence he is not even allowed to 
try the advanced examinations except by rare and very special dispensa- 
tions. Some readers may find it difficult to believe that such a situation 
could actually exist in the most advanced educational system of the 
twentieth century, but such worthy doubts are sadly dispelled when we 
read in the caption of the Regents French Ii examination! that, ‘The 

1 See above pp. 199-200. 
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minimum time requirement (for admission to the French Two-Year exami- 
nation) is five recitations a week for a school year in (a) first-year French, 
(b) second-year French,” and similar statements from the captions of the 
third- and fourth-year examinations. This rule is no better, and in all 
its insidious effects is almost certainly worse, than a rule designed to 
regulate by chronological ages the heights of school desks and the sizes 
of shoes and of hats that school children might wear. In view of the 
near equality of the second-, third- and fourth-year old-type examinations 
as to difficulty, one might surmise that these rules are set up precisely 
because the examining authorities fear that some second-year students 
might surreptitiously pass a fourth-year examination if allowed to try it. 
If this motive has had any part in leading to the adoption of the rule, it 
is a remarkable acknowledgment of the weakness of the types of examina- 
tions used by such agencies. It would be,much more reasonable to give 
extra promotion to students who learned more in less time than the aver- 
age. If a student is to study a modern language at all, he should be 
allowed to study it in a class of his peers, regardless of whether he has 
“had” five 40-minute periods per week for one or for four years. This 
can be done by defining both year-classes and individual student achieve- 
ments in terms of scores of such standardized, highly reliable, and valid 
tests as we have in the new-type parts of the Regents examinations used 
in this experiment. Thus the French classes might be defined as follows: 
at the beginning of the school year in September, 1926, the French II 
class should include only students who in June, 1926, secured scores be- 
tween 90 and 130 on the Regents new-type French examination; the 
French III class should include only students who secured scores between 
130 and 170; and the French IV class should begin with students whose 
scores range from 170 to 200. If these classes are divided into semester 
groups, the score limits should be as follows: French II A 90 to 110, 
French II B 110 to 130; French III A 130 to 150, French III B 150 to 170; 
French IV A 170 to 185; French IV B 185 to 200. Students scoring above 
200 in June, 1926, should receive credit in French IV, regardless of how 
long they have studied French; if they continue the study of French after 
thus receiving credit for French IV, they should take work more advanced 
than that in regular French IV classes. The score-limits here suggested 
are merely illustrative. The important thing is that if these definitions 
of classes should be adopted, every competent teacher everywhere ‘and 
at any time would understand them, so long as these or similar comparable 
tests were used. 

Summary. — The general conclusions from the data of this experiment 
are that the new-type examinations are roughly twice as reliable and valid 
as the old-type examinations of equal time allowance; that the new-type 
examinations afford comparable measures for all classes in a given subject 
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matter in the same and in different years and thus offer a means of elimi- 
nating overlapping of classes and variations in local school standards to 
a much greater extent than they are eliminated by the old-type Regents 
examinations; and that the new-type tests over a series of years will 
cost not more than 10% as much as old-type examinations, as adminis- 
tered and read by the College Entrance Examination Board, cost. 

The validity of the comparison of new- and old-type costs on the basis 
pf College Entrance Board figures cannot be denied on the ground that 
the actual cost of Regents reviewing under the present system is very 
much less than the cost reported by the College Entrance Examination 
Board, because the data reported above show conclusively that the sample 
method of reviewing cannot for a moment be considered as adequate. 
There is considerable evidence showing that the results of the College 
Entrance Board Examinations evince significant variations in standards 
and inaccuracies in spite of the careful and costly way in which the exami- 
nations are read. That such variations and inaccuracies are necessarily 
greater when the reading is less complete is self-evident. 

In this study we have discussed the weaknesses existing in the Regents 
examinations and procedures for purely constructive ends. Our funda- 
mental recommendation, therefore, is that the Regents system should 
take advantage of the values of the new-type without surrendering any 
of the real values of the old-type examinations. The committees charged 
with the task of constructing old-type questions should be provided with 
means for securing objective and experimental evidence on the difficulties 
and measurement values of old-type questions. It seems almost certain 
that the full values of the old-type question forms have never been fully 
exploited because of the reliance in their construction on the subjective 
opinion of scholars rather than upon experimental evidence. 

It seems fair to suggest that the new-type forms of questions be adopted 
for half of each Regents examination period. More detailed recommenda- 
tions obviously are not within the province of this report. How and by 
whom the new-type tests, if adopted, will be constructed; how, when and 
by whom they shall be scored; what weight shall be given to the new-type 
parts of the examinations; in how many and in what subject-matters 
shall the new-type forms be tried out first; what budgetary redistributions 
and what administrative reorganizations shall be necessary in the Exami- 
nations and Inspections Division of the State Education Department, — 
these are only a few of the important questions that would have to be 
faced if the new-type forms were adopted by the Regents. They are 
obviously questions which can be solved only by careful consideration on 
the part of the officers of the State Department of Education. 
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SECOND SURVEY! OF MODERN LANGUAGE ACHIEVEMENT 
IN THE JUNIOR HIGH SCHOOLS OF NEW YORK CITY 


» Introduction. — In June, 1925, Form A of The American Council Beta 
French test was administered to approximately 19,000 students of French, 
and Form A of The American Council Beta Spanish test to approximately 
6500 students of Spanish in the junior high schools of New York City. 
In June, 1926, Forms B of The American Council Beta French and Spanish 
tests were administered to 18,870 students of French and to 3940 students 
of Spanish in the junior high schools. ‘The results of the 1925 tests have 
been analyzed and reported above.2. This paper relates to the 1926 test 
results, and is offered as a preliminary report on a few of the most impor- 
tant aspects of the data at hand, so that the readers of the report on the 
1925 data may have the “follow-up” data in the same volume. 


TABLE 46 


NumsBers or New York CIty JUNIOR HIGH SCHOOL STUDENTS TAKING THE 
AMERICAN Counci, Beta Tests IN FRENCH AND SPANISH IN 1925 anv 1926. 


FRENCH SPANISH 
Crass 
June, 1925 June, 1926 June, 1925 June, 1926 

Form A Form B Form A Form B 

8A 3,736 4,054 1,486 1,239 
8B 3,041 4,072 1,332 945 
9A 2,604 PAA AE 658 489 
9B 2,417 1,723 688 426 
RB 2,766 2,842 850 333 
RC 2,415 2,350 683 275 
RD 1,898 1,642 721 233 
Totals 18,877 18,870 6,418 3,940 


Number of junior high school students taking the tests in 1925 and in 
1926. — Table 46 shows the numbers of students taking each test in each 
year by semester classes. It would have been interesting to study the 
variations in numbers of students in the same classes in the two years, 


1 This study was made possible by grants from The Commonwealth Fund, and from The Carnegie 
Corporation through The Modern Foreign Language Study; by the codperation of the research staff in 
the office of Dean Hawkes of Columbia College; and by the codperation of Mr. Jacob Greenberg, Director 
of Foreign Languages in the Junior High Schools of New York City. Especial thanks are due, and heartily 
made, to Professors R. H. Fife, Algernon Coleman, V. A. C. Henmon, J. P. W. Crawford, C. M. Purin, 
and their associates in The Modern Foreign Language Study for support and helpful counsel. 

2 Pp. 3 to 1038, above. 
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Scores on American Council Beta French Tests 
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Cuarr 48. — Graphs of 1925 and 1926 French class medians, from Table 47, showing 
equivalence of Forms A and B of the American Council Beta French Test. The near- 
rectilinearity of the lines indicates that both Forms of the test are adequately scaled in 
difficulty throughout the whole range of achievement, and that the units of the scale 
are approximately equal throughout the whole range of achievement in French in the 
junior high schools of New York City. The slight downward dip in the curves for the 
normal classes is easily accounted for by the retention of “repeaters” and by the pro- 
motion out of 8A of some students who should be eliminated or held in 8A. 
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but the means for going into this important question were lacking. Why, 
for example, should there be only 3000 students in French 8B in 1925 
and 4000 in 1926? Still more important, why should there be a falling 
off of enrolment in Spanish from 6400 to 3940? Such variations are too 


Form A 1925 
See eS — Form B 1926 


Scores on American Council Beta Spanish Tests 


Cuart 49. — Graphs of 1925 and 1926 Spanish class medians, showing equivalence of 
the American Council Beta Spanish Tests. See Chart 48. The apparent differences are 
very likely due to changes in the character of the student body in Spanish which occurred 
between June, 1925 and June, 1926: the Spanish enrolment in June, 1926, was less than 
62 per cent. as large as that in June, 1925. 


large to be due to chance and are therefore worthy of careful analysis. 
Whatever caused the landslide of nearly 40 per cent of the 1925 Spanish 
enrolment out of Spanish in 1926 is peculiar to Spanish since it did not 
affect the French enrolment. 

Equivalence of Forms A and B. — The American Council Beta tests in 
French and Spanish were originally constructed in two equivalent Forms, 
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A and B. Unless the modern language tests that are used in our schools 
are either equivalent or comparable, the measurements derived from 
different tests will be more or less meaningless and therefore of little 
use in the administration of modern language work and in research looking 
toward its improvement. The equivalence of Forms A and B of the 
American Council Beta tests was determined experimentally as explained 


TABLE 47 


EQUIVALENCE oF Forms A AND B or American CounciL Bera TESTS; SHOWING 
MEDIANS AND QUARTILES OF EACH CLASS IN FRENCH AND SPANISH. THE NUMBER 
OF CASES ON WHICH EACH MEDIAN IS BASED IS SHOWN IN TABLE 46. 


FRENCH SPANISH 

FormA . Form B Form A Form B 

1925 4926 1925 1926 

Lower Quartile 29.2 32.7 32.8 33.3 

8A Median 39.5 40.4 41.3 42.8 
Upper Quartile 50.0 49.2 49.7 54.2 

Lower Quartile 47.0 47.6 44.7 47.6 

8B Median 61.0 58.3 54.7 57.7 
Upper Quartile UE 70.5 66.7 70.0 

Lower Quartile 70.5 67.0 60.5 58.3 

9A Median 87.3 83.6 71.8 72.0 
Upper Quartile 105.5 100.0 86.0 86.8 

Lower Quartile 110.2 99.6 86.4 82.5 

9B Median 130.0 123.8 109.1 103.1 
Upper Quartile 148.0 140.7 136.3 126.0 

Lower Quartile 40.0 42.4 35.0 40.0 

RB Median 51.2 52.8 44.9 48.6 
Upper Quartile 64.3 64.1 5o.2 57.4 

Lower Quartile 75.5 (5a) 68.6 65.6 

RC Median 94.0 91.8 82.6 78.0 
Upper Quartile 111.5 110.0 98.0 92.0 

Lower Quartile 116.7 115.0 93.8 88.0 

RD Median 140.2 136.2 115.3 102.5 
Upper Quartile 160.5 154.0 134.2 122.5 


above on pages 3 to 4. Table 47 and Charts 48 and 49 justify the claims 
of equivalence made in that report, since the differences for the French 
test are entirely negligible and the slightly larger differences for the 
Spanish test may easily be explained in terms of the small numbers of 
cases involved — especially in the norms for Form B. (We have already 
called attention to the fact that the 1926 Spanish enrolment is less than 
62% of the 1925 enrolment). 

In both charts the near-rectilinearity of the graphs of the norms is 
worthy of note. The Form A line for the rapid advancement French 
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classes is practically straight throughout; and the Form B line parallels 
it with only negligible deviations. Aside from confirming the almost exact 
equivalence of the tests these lines indicate that the tests are adequately 
sealed in difficulty for the whole range of achievement in the junior high 
schools and that the units of the scale are approximately equal through- 
out the whole range of achievement. In both charts the downward dip 
of the curves for the normal classes is easily accounted for by the retention 
ef “repeaters”? and by the promotion out of 8A of some students who 
should be eliminated from modern language work or held in 8A. The 
most important indication of Charts 48 and 49 is that the experimental 
method of constructing tests, if used with sufficient care, will produce 
tests capable of affording comparable measurements of modern language 
achievement. 

Influence of homogeneity of classes on achievement. — Two conditions 
disclosed by the 1925 data were the excessive heterogeneity of classes 
within a given school and the excessive variability of standards in different 
schools. It was hoped that at least some of the schools might use the 
results of the 1925 tests to reclassify the students of French and Spanish 
and thus secure homogeneous classes. If this had been done we could 
have compared the progress of homogeneous classes with that of hetero- 
geneous classes; but, as we shall see, so few of the schools found it possible 
to use the 1925 test results to any considerable extent for reclassifying 
purposes that no careful study of the influence of homogeneity on achieve- 
ment could be made. However, correlations between degree of homo- 
geneity and amount of achievement were calculated on the theory that, 
although the variations in homogeneity were apparently due entirely to 
chance, nevertheless some relation might be found between homogeneity 
and achievement. These correlations turned out to be almost exactly 
zero. This is by no means evidence that homogeneity is undesirable, 
for even if its influence were very great, it would have been — and in 
this case most probably was — masked by still more powerful influences 
inherent in each individual school situation. We shall illustrate con- 
cretely the effects of some of these hidden influences in succeeding sections. 

Constancy of progress. — Table 48 shows the correlations between the 
scores achieved by students on Form A in 1925 and on Form B in 1926 
of the French and Spanish tests. Under continuously ideal school condi- 
tions, it is plausible to assume that normally students will progress at a 
uniform rate, so that if a student in his first year achieves what is normal 
for the first-year students we should expect him in his second year to 
achieve what is normal for second-year students; also, if in his first year 
he got within the highest or lowest 10% of first-year students we would 
expect him, in his second year, to be in the highest or lowest 10% of 
second-year students. Since continuously ideal school conditions, as far 
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as we know, do not exist, it is impossible to verify this plausible theory. 
Common sense favors the theory, but however reasonable or unreasonable 
it may seem, Table 48 shows that a uniform rate of progress in individual 
students in modern language work is the exception rather than the rule 
in the junior high schools of New York City. 


TABLE 48 


CoNSTANCY OF PROGRESS IN MODERN LANGUAGE WORK. CORRELATIONS BETWEEN 
SCORES ON Form A TAKEN IN JUNE, 1925, AND SCORES ON Form B, TAKEN IN JUNE, 
1926. CoLumNs 1 AND 6 SHOW THE CORRELATION COEFFICIENTS; COLUMNS 2 AND 7 
THE COEFFICIENTS OF ALIENATION, BY THE FORMULA k=V1—7"; 3 AND 8 THE SIGMAS 
or THE Form B 1926 scorEs; 4 AND 9 THE STANDARD ERRORS OF ESTIMATE OF Form B 
1926 scorres rromM Form A 1925 scoRES; AND COLUMNS 5 AND 10 SHOW THE NUMBER 
OF CASES INVOLVED IN EACH CORRELATION IN COLUMNS 1 AND 6. 


FRENCH SPANISH 
1 2 3 4 5 6 7 8 9 10 
195-06 K 706 F 06.05 N Tos-06 K 6 F25.05 N 


8Ax5—9 Avs 326 | .942 24 22.6 | 15383 | .553 | .8382 | 25.0 | 21.0 | 372 
8B2;-9Bos .455 | .890 29 26.0 | 13882 | .628 | .777 | 34.0 | 26.0 | 342 
RBy-RD | .463 | .885 28 25.0 | 1838 | .389 | .920 | 24.5 | 22.5 


The correlations are all very low; the coefficients of alienation in col- 
umns 2 and 7 of Table 48 show that the relation between the achievements 
of students in successive years, as the classes were organized and taught 
in the sessions of 1924-1926, is very nearly one of pure chance. There 
are only slight differences between the three classes described at the ex- 
treme left of Table 48, but the order of magnitude of the correlations 
from highest to lowest, according to columns 4 and 9, is the same for both 
French and Spanish, — 8A-9A, RB-RD, 8B-9B. Columns 4 and 9 show 
the magnitude of the correlations for each language in terms of identical 
units, that is, in terms of standard error of estimate expressed in terms 
of the score-points of each test. If there is any tendency toward a uni- 
form rate of progress in individual students these correlations demonstrate 
that in the junior high schools of New York City that tendency is almost 
completely overridden by more powerful forces. 

That such powerful influences are due to chance variations in the peda- 
gogical conditions in individual schools is strongly indicated by the correla- 
tions of Table 49. Table 48 shows that in the junior high schools of 
New York City a uniform rate of achievement for individual students is 
the exception, and Table 49 shows that a uniform rate of achievement 
for classes composed of the same individual students is equally exceptional, 
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TABLE 49 


CoNSTANCY OF PROGRESS IN MODERN LANGUAGE WORK. CORRELATIONS BETWEEN 
1925 MEDIANS AND 1926 MEDIANS OF INDIVIDUAL SCHOOL CLASSES MADE UP OF THE 


SAME INDIVIDUAL STUDENTS. 


CorRELATIONS BETWEEN Mempians or 
8A, 1925 8B, 1925 RB, 1925 
¥ an and and 
9A, 1926 9B, 1926 RD, 1926 
Hren¢hi oF a dae see 0.361 0.222 0.494 
No. of pairs of Medians . (81) (27) (27) 
Spanish nae NE: Ge Ee 0.322 0.701 0.000 
No. of pairs of Medians . (14) (12) (8) 


The method of calculating the correlations of Table 49 will be illustrated 
by a consideration of the first coefficient in the table, 0.361: Among the 
junior high schools of New York City there were 31 that had 8A classes 
taking the examination in June, 1925, and 9A classes taking the examina- 
tion in June, 1926. These 8A and 9A class rolls were compared and a list 
was made for each of the 31 schools of the individual students common to 
both classes. These lists included both the Form A 1925 and Form B 
1926 scores, and the medians of the two sets of scores for each list of 
students were calculated. Thus we secured 31 pairs of medians, the 
correlation between which turned out to be 0.361. 

The magnitude of the correlations in Table 49 is so small as to constitute 
a serious indictment of the system of educational guidance, or lack of it, 
which is apparently common to all departments of instruction in our edu- 
cational system. The average of the correlations of Table 49 is only 0.35, 
which means that the relation between the rate or amount of progress of a 
whole class in its first year of modern language work and the rate or amount 
of progress of the same group of individuals in the same school in the 
second year of its modern language work is very nearly a pure chance 
relation, wnder the present organization of modern language work in the 
junior high schools of New York City. The low correlations of Table 48 
might plausibly be explained as due to more or less accidental shifts in the 
interests of individual students, or to temporary or permanent variation 
in their ability to learn a modern language, or to any one of a dozen other 
things which might affect the achievement of an individual student; but it 
is highly improbable that any one or all of these chance influences could 
operate in the same direction on each and every individual in a class of 25 
to 150 students. The variable rate of achievement of whole classes can be 
explained only in terms of variations in the learning conditions in individual 
classrooms, such as the assignment of a class that has for the first year 
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enjoyed a good teacher to a poor teacher, or the reverse; or the shift from a 
good textbook or a good method of instruction to a bad textbook, or bad 
method of instruction, or the reverse; or the merging of a hitherto very 
homogeneous class with a heterogeneous group of students, or the reverse. 
The operation of any one or of all of these possible influences, with the 
general background of extreme heterogeneity of classes and of incredible 
variations in the standards of individual schools, is sufficient to explain 
the failure of some schools to develop demonstrated ability in large groups 
of students, and the failure of other schools to make manifest the real 
ability of other large groups of students until the second year. 


TABLE 50 


NUMBERS AND PROPORTIONS OF STUDENTS WHO SCORED LOWER ON Form B IN JUNE, 
1926, THAN ON Form A IN JuNzE, 1925. 


ToraLt NUMBER OF STUDENTS | ScorED LOWER IN 1926 THAN IN 1925, 
Wuo Toox Botu Forms Havine Bren REGULARLY IN CLAss 
or Test AND WHOSE Pa- DURING THE INTERVENING YEAR AND 
PERS CouLD Bre MaTcHeEnD, PROMOTED AT Least ONCE, e.g. 8A 
Tuat Is, WHo Toox Form TO 8B. 

A IN 1925 In, e.g. 8A, AND 


Form B in 9A. 

Numbers Per cents 
8A-9A 1533 33 2.2% 
French . . 8B-9B 1382 12 0.8% 
RB-RD 1338 3 0.2% 
8A-9A 372 12 3.4% 
Spanish) ye) 8B-9B 340 6 2.2% 
RB-RD 208 2 1.0% 

Total French 
and Spanish 5173 68 1.3% 


It would be almost absurd to try to explain the small magnitude of the 
correlations of either Table 48 or Table 49 as due to any considerable 
extent to the unreliability of the American Council Beta tests. The 
report on the analysis of the 1925 data is sufficient to eliminate this argu- 
ment.! Incidentally the correlation scatter-diagrams of Table 48 enable 
us to secure additional evidence on the reliability of the tests. According 
to Table 50, only 68 out of 5173 students secured lower scores on Form B 
in 1926 than they secured on Form A in 1925, and of these 68 less than a 
dozen were lower by more than five or ten points in a total possible range 
of scores of 9 to 220. In comparison with similar data from equivalent 
tests that have been given a year apart, these figures are quite gratifying. 
The American Council Beta tests are not, of course, perfectly reliable, and 
some of the variations pointed out are undoubtedly due to this fact; but 


1Cf. above pp. 3 to 103. 


SECOND SURVEY: JUNIOR HIGH SCHOOLS ool 


it is easy to conceive other reasons why at least 13 out of 1000 students 
tested a year apart might secure lower scores on the second than on the 
first test. 

Extent of reclassification.—As noted above it was hoped that the objec- 
tive and comparable measurements secured by means of the American 
Council Beta tests in 1925 would be used by some of the schools for pur- 
poses of making their classes homogeneous and of equating the standards 
for the various semester groups in all of the schools. Table 51 furnishes 
an indication of the extent to which homogeneity of classes was achieved. 

Table 51 shows that the interquartile ranges of the classes are exceed- 
ingly variable. One of the classes has an interquartile range of only 8 or 9 


TABLE 51 


CoMPARISON OF VARIABILITIES OF CLASSES IN INDIVIDUAL SCHOOLS. DISTRIBUTIONS 
OF INTERQUARTILE RANGES OF FRENCH AND SPANISH CLASSES IN THE JUNIOR HIGH 
scHooLs oF New York City In Junz, 1925, anp Junz, 1926. THESE FIGURES ARE 
BASED ONLY ON SCORES OF STUDENTS WHO WERE COMMON TO BOTH CLASSES INDICATED 
ABOVE EACH PAIR OF COLUMNS. Form A WAS USED IN 1925 anp Form B In 1926 or 
BOTH FRENCH AND SPANISH TESTS. 


FRENCH SPANISH 


pode: seeeiag Ve SB-9B RB-RD | 8A-9A | 8B-9B | RB-RD 


or Score-PoINtTs 
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points, and one has an interquartile range of over 50 points. Fifty points 
represents a greater range of achievement than the difference between the 
8A and 9A city-wide class medians, and is greater than the difference 
between the medians of any pair of successive semester classes in either 
language. Ten of the 82 individual school French classes in June, 1925, 
had interquartile ranges of less than 14 score-points, and 14 had inter- 
quartile ranges of 28 or more points. In June, 1926, only one French 
class had an interquartile range of less than 14 points and 31 individual 
school classes had interquartile ranges greater than 28 points. These 
figures are based only on scores of students who were in the same school 
‘and in corresponding classes in 1925 and 1926. This illustrates strikingly 
the need for continuous educational guidance of students. 

Table 51 shows that some of the junior high schools started the year with 
comparatively well-organized and homogeneous classes and kept their 
classes homogeneous throughout the year; while most of the schools started 
badly and became worse so far as homogeneity of classes is concerned. 
The methods used by the former should be carefully studied and made 
known to the latter schools. It cannot be too often urged that more atten- 
tion must be given to the task of learning students and of guiding them; 
we are now emphasizing teaching at the expense of good teaching because a 
large fraction of the teaching effort is misguided and wasted. In this 
connection the reader may review the argument of pages 99 to 103 of this 
volume. 

Variable standards of schools.—Table 52, showing distributions of the 
medians of the same individual school classes whose interquartile ranges 
are distributed in Table 51, further emphasizes the need for continuous 
constructive guidance of students on the basis of objective and comparable 
measurements of defined achievement. 

Table 52 gives us a picture of the persistence of the incredibly large 
differences in standards of individual schools disclosed in the report of the 
1925 test results.1_ In spite of the availability of the accurate and compa- 
rable Form A measurements of June, 1925, two schools started the session of 
1925-1926 with classes which were called ‘8B”’ but which were actually 
above the 8B average. One of these two classes in September, 1925, was 
just average for 9A classes (see Table 52, column headed ‘‘8A—A-1925”’), 
The third column in Table 52 shows that in three schools the classes which 
started as 9A classes in September, 1925, were actually below the average 
of 8A classes and that two were above the average of 9A classes. 

It is difficult to explain the persistence of these variations in the standards 
of individual schools, especially since most of the modern language teachers 
in the city were anxious to improve the situation. There can be little 
doubt, however, that the lack of flexibility in the existing administrative 


1See above pp. 3 to 38. 
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TABLE 52 


CoMPARISON OF MEDIANS OF CLASSES IN INDIVIDUAL SCHOOLS. DrsTRIBUTIONS 
OF MEDIANS OF FRENCH AND SPANISH CLASSES IN THE JUNIOR HIGH SCHOOLS or NEw 
York Ciry In June, 1925, anp Junr, 1926. Turse FIGURES ARE BASED ONLY ON 
SCORES OF STUDENTS WHO WERE COMMON TO BOTH CLASSES INDICATED ABOVE EACH 
PAIR OF COLUMNS. Form A WAS USED IN 1925 AND Form B in 1926 oF BoTH FRENCH 
AND SPANISH TESTS. 

2 


FRENCH SPANISH 


Score VALUES SA-9A SB-9B RB-RD 
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organization is partly or largely responsible. The needs of modern educa- 
tion have outgrown their administrative and organizational habiliments. 
Within the last generation our public school system has become increasingly 
a matter of mass education without any corresponding increase or develop- 
ment of the administrative system, so far as the intelligent and effective 
guidance of the multiplied thousands of school children is concerned. 
School administration has been made comparatively efficient on the 
material side, but practically nothing has been achieved in the direction of 
keeping individual students continuously in proper relation to the cur- 
riculum. We know infinitely more about the buildings in which our children 
attend classes than we know about the children themselves; partly be- 
cause we have spent more money and time on learning about buildings 
than we have spent in learning about children. Recent developments in 
the field of educational measurement enable, us to secure relatively accurate 
measures of defined achievement in comparable and meaningful terms, but 
our administrative system is not at present able to use this information in 
the constructive guidance of students. Tables 51 and 52 constitute concrete 
evidence of this fact. Efficiency of teaching has been too much associated 
with what might be called “minor tactics’ of the classroom. School 
administrations, in addition to caring for the physical plant and announcing 
the major goals of education in general terms, must cease to lean so heavily 
upon the tactics of the individual classroom and must develop a strategy 
which will give the tacticians of the classroom a real opportunity. It is 
certainly poor strategy to put into a 9A class of 50 students ten who are 
below the 8A average and ten who are above the 9B average; yet, according 
to both the 1925 and 1926 data, this is very nearly the typical situation in 
the junior high schools of New York City. 

Progress of semester classes in individual schools in relation to initial 
achievement and to initial homogeneity.—In order to illustrate concretely 
some of the facts reported above in technical terms, Charts 50 to 52 have 
been prepared. Since these charts are exactly parallel in form, only the 
first one will be described in detail. 

The first thing to fix in mind in regard to Chart 50 is that the heavy 
vertical lines at 40 and at 87 represent the approximate city-wide 8A and 
9A medians. The heavy and light horizontal lines represent the inter- 
quartile ranges of the scores of the same students in 1925 and 1926, respec- 
tively. Thus the first pair of horizontal lines at the top of the chart 
relate to a school whose identity we have concealed under the number 1. 
In this school there were 58, 8A students who took Form A in J une, 1925, 
and who took Form B in 1926 as 9A students. The last pair of lines on the 
chart relate to 72 students from school 31 who took Form A in June, 1925, 
as 8A students and who took Form B in June, 1926, as 9A students. In 
every case students who were in the 8A class in 1925 and not in the 9A class 
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Cuart 50. — Medians and interquartile ranges of American Council Beta French 
Test scores of identical groups of students who took Form A in June, 1925 as 8A students 
and Form B in June, 1926 as 9A students. 
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Cuart 51. — Medians and interquartile ranges of American Council Beta French Test 
scores of identical groups of students who took Form A in June, 1925 as 8B students and 
Form B in June, 1926 as 9B students. The heavy vertical lines at 60 and 130 indicate the 
approximate city-wide 8B and 9B medians, respectively. This chart is similar to and 
should be read in the same way as Chart 50, gq. »v. 
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Cuart 52. — Medians and interquartile ranges of American Council Beta French 
Test scores of identical groups of students who took Form A in June, 1925 as RB students 
and Form B in June, 1926 as RD students. The heavy vertical lines at 51 and 140 indi- 
cate the approximate city-wide RB and RD medians, respectively. This chart is similar 
to and should be read in the same way as Chart 50, gq. v. 
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in 1926, or who were in the 9A 1926 class and not in the 8A 1925 class in 
the same school, were eliminated from consideration, so far as Chart 50 is 
concerned. This is also true of the 8B-9B and RB-RD groups treated in 
Charts 51 and 52. With rare exceptions this required the elimination of 
only a few students. The schools are arranged in the order of the magni- 
tudes of the Form A 1925 8A medians on Charts 50 to 52, inclusive. 

Chart 50 is fairly typical of all three of these charts. Comparing the 
first and last pairs of interquartile ranges on Chart 50, we find a very 
striking difference. In school 1 the 8A class in June, 1925, had a median 
score almost 20 points below the city-wide 8A median. During the 1925- 
1926 session the 52 students in this class became 9A students and progressed 
at a rate which placed their median in June, 1926, 16 score-points below 
the 9A city-wide median. Their rate of progress in the second and third 
semesters of their modern language work was consistent with that in their 
first semester. In school 31 72 8A students secured a median score in 
June, 1925, just five or six points below the city-wide 9A median; during 
the 1925-1926 session these 72 students became 9A students, thus advan- 
cing a whole year in credit status but increasing their median score by less 
than three points! The school situation which produced this incredible 
result is certainly worthy of most careful investigation. Similarly the 
situation in school 4, which carried a sub-average 8A class of 23 students 
in one year to a median achievement only slightly below the 9B city-wide 
average, is worthy of careful study and emulation. 

In Charts 51 and 52 the schools are in the same order as in Chart 50. 
This fact enables the reader to secure a concrete picture of the confusion 
of standards and variations in efficiency within a single school. For 
example, in Chart 50 school 3 is below average and in Chart 51 it is distinctly 
above average. In Chart 50 school 23 started above average and ended 
below average, and in Chart 51 school 23 started below average and ended 
above average. Other striking variations in these charts are left to the 
interest of the reader. 

Similar charts have been prepared for corresponding groups of Spanish 
classes, but their indications are so closely parallel to those of Charts 50, 51, 
and 52 that, to conserve time and space, they are omitted from the present 
report. 

Summary.—The charts amply confirm four conclusions arrived at in 
preceding pages. 


1. There is little or no relation between the progress of classes composed of the 
same individual students in 1924-1925 and in 1925-1926. 

2. There is little or no relation between progress and degree of homogeneity of 
classes. 

3. The individual classroom situation is more potent in determining the progress 
of the class than any other influence that we can isolate. 
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4. The heterogeneity of individual classroom situations, displayed by these 
charts, superimposed on the heterogeneity of initial class averages, plus the 
fact of the imperfect reliability of the tests, is more than enough to mask 
whatever relation there might be under ideal conditions between homo- 
geneity and achievement, on the one hand, and between initial and final 
achievement of individual students and of classes, on the other. In other 
words these charts are more than enough to account for the zero correlations 
reported above on page 327 and for the small magnitude of the correlations 
in Tables 48 and 49. Indeed, that there should be any positive correlation 
between the 1925 and 1926 achievements of individual students under 
the conditions disclosed by these charts, is a very striking manifestation of 
the strength of the tendency toward constancy of achievement rates in 
individual students. 


The importance of the question of the constancy of achievement rates in 
individual students is self-evident, because guidance is in effect prediction, 
and we have no way of judging the future but by the past of a student. So 
long as the real potentialities of individual students and of classes are 
masked by being made the sports of chance in the chaos of school and class- 
room situations uncovered by these charts, the effective educational 
guidance of students will remain a pious hope. Nothing short of a large 
program of research with the use of accurate and comparable measure- 
ments, and of continuous student-accounting, will avail in this situation. 

For example, many important questions on which our data would have 
shed light have not been considered in these pages. Among these are: 
What was the fate, during 1925-1926, of students who showed themselves 
in 1924-1925 to be exceptionally gifted or exceptionally deficient with 
regard to modern language achievement? What was the relation of 
progress in 1925-1926 to (a) kind of textbooks used, (b) variety of text- 
books used, (c) shifts in textbooks, (d) methods of instruction, (e) kind of 
vocabulary presented, and (f) to rate of presentation of vocabulary and 
grammatical materials? What was the relation of sex to amount and rate 
of achievement in 1925-1926? (The data were especially suited to the 
study of this question, since we could compare the progress of boys and of 
girls in terms of equal units, with other factors rendered constant, such 
as age, initial learning, etc.) What was the relation of age, and of language 
spoken at home, and of taking other languages, to achievement in modern 
language work? 

While an investigation of these and other similar questions cannot be 
undertaken in this report, the conclusions that have been arrived at are 
of far-reaching significance; they will illuminate other diagnostic studies 
that might be made of present and of future conditions in modern language 
teaching, and are at once suggestive of immediate corrective measures. 
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